Results 1 -
5 of
5
Action-Conditional Video Prediction using Deep Networks in Atari Games
"... Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future image-frames de-pend on control variables or actions as well as previous frames. While ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Motivated by vision-based reinforcement learning (RL) problems, in particular Atari games from the recent benchmark Aracade Learning Environment (ALE), we consider spatio-temporal prediction problems where future image-frames de-pend on control variables or actions as well as previous frames. While not com-posed of natural scenes, frames in Atari games are high-dimensional in size, can involve tens of objects with one or more objects being controlled by the actions directly and many other objects being influenced indirectly, can involve entry and departure of objects, and can involve deep partial observability. We propose and evaluate two deep neural network architectures that consist of encoding, action-conditional transformation, and decoding layers based on convolutional neural networks and recurrent neural networks. Experimental results show that the pro-posed architectures are able to generate visually-realistic frames that are also use-ful for control over approximately 100-step action-conditional futures in some games. To the best of our knowledge, this paper is the first to make and evaluate long-term predictions on high-dimensional video conditioned by control inputs. 1
Jointly modeling embedding and translation to bridge video and language.
, 2016
"... Abstract Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs) ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract Automatically describing video content with natural language is a fundamental challenge of computer vision. Recurrent Neural Networks (RNNs)
Spatiotemporal Residual Networks for Video Action Recognition
"... Abstract Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Two-stream Convolutional Networks (ConvNets) have shown strong performance for human action recognition in videos. Recently, Residual Networks (ResNets) have arisen as a new technique to train extremely deep architectures. In this paper, we introduce spatiotemporal ResNets as a combination of these two approaches. Our novel architecture generalizes ResNets for the spatiotemporal domain by introducing residual connections in two ways. First, we inject residual connections between the appearance and motion pathways of a two-stream architecture to allow spatiotemporal interaction between the two streams. Second, we transform pretrained image ConvNets into spatiotemporal networks by equipping them with learnable convolutional filters that are initialized as temporal residual connections and operate on adjacent feature maps in time. This approach slowly increases the spatiotemporal receptive field as the depth of the model increases and naturally integrates image ConvNet design principles. The whole model is trained end-to-end to allow hierarchical learning of complex spatiotemporal features. We evaluate our novel spatiotemporal ResNet using two widely used action recognition benchmarks where it exceeds the previous state-of-the-art.
Early Embedding and Late Reranking for Video Captioning
, 2016
"... ABSTRACT This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT This paper describes our solution for the MSR Video to Language Challenge. We start from the popular ConvNet + LSTM model, which we extend with two novel modules. One is early embedding, which enriches the current low-level input to LSTM by tag embeddings. The other is late reranking, for re-scoring generated sentences in terms of their relevance to a specific video. The modules are inspired by recent works on image captioning, repurposed and redesigned for video. As experiments on the MSR-VTT validation set show, the joint use of these two modules add a clear improvement over a non-trivial ConvNet + LSTM baseline under four performance metrics. The viability of the proposed solution is further confirmed by the blind test by the organizers. Our system is ranked at the 4th place in terms of overall performance, while scoring the best CIDEr-D, which measures the human-likeness of generated captions.
RUC at MediaEval 2016 Emotional Impact of Movies Task: Fusion of Multimodal Features
"... ABSTRACT In this paper, we present our approaches for the Mediaeval Emotional Impact of Movies Task. We extract features from multiple modalities including audio, image and motion modalities. SVR and Random Forest are used as our regression models and late fusion is applied to fuse different modali ..."
Abstract
- Add to MetaCart
(Show Context)
ABSTRACT In this paper, we present our approaches for the Mediaeval Emotional Impact of Movies Task. We extract features from multiple modalities including audio, image and motion modalities. SVR and Random Forest are used as our regression models and late fusion is applied to fuse different modalities. Experimental results show that the multimodal late fusion is beneficial to predict global affects and continuous arousal and using CNN features can further boost the performance. But for continuous valence prediction the acoustic features are superior to other features.