Results 1 - 10
of
109
Two-stream convolutional networks for action recognition in videos
- CoRR
"... We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design as ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
(Show Context)
We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architec-ture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification. 1
Action and Event Recognition with Fisher Vectors on a Compact Feature Set
, 2014
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice
, 2014
"... ar ..."
(Show Context)
Video action detection with relational dynamic-poselets
- In ECCV
, 2014
"... •Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights: ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
•Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights:
Spatio-Temporal Object Detection Proposals
"... Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the fra ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the frames. Recently methods that generate unsupervised detection proposals have proven to be very effective for object detection in still images. These methods open the possibility to use strong but computationally expensive features since only a relatively small number of detection hypotheses need to be assessed. In this paper we make two contribu-tions towards exploiting detection proposals for spatio-temporal detection prob-lems. First, we extend a recent 2D object proposal method, to produce spatio-temporal proposals by a randomized supervoxel merging process. We introduce spatial, temporal, and spatio-temporal pairwise supervoxel features that are used to guide the merging process. Second, we propose a new efficient supervoxel method. We experimentally evaluate our detection proposals, in combination with our new supervoxel method as well as existing ones. This evaluation shows that our supervoxels lead to more accurate proposals when compared to using existing state-of-the-art supervoxel methods. 1
Action and Event Recognition with Fisher Vectors on a Compact Feature Set
"... Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex models, we focus on the low-level features and their encoding. We evaluate the use of Fisher vectors as an alternative to bag-of-word histograms to aggregate a small set of state-of-the-art low-level descriptors, in combination with linear classifiers. We present a large and varied set of evaluations, considering (i) classification of short actions in five datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that for basic action recognition and localization MBH features alone are enough for stateof-the-art performance. For complex events we find that SIFT and MFCC features provide complementary cues. On all three problems we obtain state-of-the-art results, while using fewer features and less complex models. 1.
Weakly Supervised Action Labeling in Videos Under Ordering Constraints
"... Abstract. We are given a set of video clips, each one annotated with an ordered list of actions, such as “walk ” then “sit ” then “answer phone” extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discrimin ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We are given a set of video clips, each one annotated with an ordered list of actions, such as “walk ” then “sit ” then “answer phone” extracted from, for example, the associated text script. We seek to temporally localize the individual actions in each clip as well as to learn a discriminative classifier for each action. We formulate the problem as a weakly supervised temporal assignment with ordering constraints. Each video clip is divided into small time intervals and each time interval of each video clip is assigned one action label, while respecting the order in which the action labels appear in the given annotations. We show that the action label assignment can be determined together with learning a classifier for each action in a discriminative manner. We evaluate the proposed model on a new and challenging dataset of 937 video clips with a total of 787720 frames containing sequences of 16 different actions from 69 Hollywood movies. 1
Action recognition with trajectory-pooled deep-convolutional descriptors
- In CVPR
, 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
Efficient Action Localization with Approximately Normalized Fisher Vectors
"... The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word represen-tation. Transformation of the FV by power and ℓ2 normal-izations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classifica ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word represen-tation. Transformation of the FV by power and ℓ2 normal-izations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks. These normaliza-tions, however, render the representation non-additive over local descriptors. Combined with its high dimensionality, this makes the FV computationally expensive for the pur-pose of localization tasks. In this paper we present approx-imations to both these normalizations, which yield signifi-cant improvements in the memory and computational costs of the FV when used for localization. Second, we show how these approximations can be used to define upper-bounds on the score function that can be efficiently evaluated, which enables the use of branch-and-bound search as an alterna-tive to exhaustive sliding window search. We present ex-perimental evaluation results on classification and tempo-ral localization of actions in videos. These show that the our approximations lead to a speedup of at least one or-der of magnitude, while maintaining state-of-the-art action recognition and localization performance. 1.
Instructional videos for unsupervised harvesting and learning of action examples
- In ACM MM
, 2014
"... ABSTRACT Online instructional videos have become a popular way for people to learn new skills encompassing art, cooking and sports. As watching instructional videos is a natural way for humans to learn, analogously, machines can also gain knowledge from these videos. We propose to utilize the large ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
ABSTRACT Online instructional videos have become a popular way for people to learn new skills encompassing art, cooking and sports. As watching instructional videos is a natural way for humans to learn, analogously, machines can also gain knowledge from these videos. We propose to utilize the large amount of instructional videos available online to harvest examples of various actions in an unsupervised fashion. The key observation is that in instructional videos, the instructor's action is highly correlated with the instructor's narration. By leveraging this correlation, we can exploit the timing of action corresponding terms in the speech transcript to temporally localize actions in the video and harvest action examples. The proposed method is scalable as it requires no human intervention. Experiments show that the examples harvested are of reasonably good quality, and action detectors trained on data collected by our unsupervised method yields comparable performance with detectors trained with manually collected data on the TRECVID Multimedia Event Detection task.