Results 1 - 10
of
43
Action Recognition with Improved Trajectories
- in IEEE International Conference on Computer Vision (ICCV
, 2013
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 109 (11 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Two-stream convolutional networks for action recognition in videos
- CoRR
"... We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design as ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
(Show Context)
We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architec-ture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification. 1
Action and Event Recognition with Fisher Vectors on a Compact Feature Set
, 2014
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 39 (7 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Spatio-Temporal Object Detection Proposals
"... Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the fra ..."
Abstract
-
Cited by 12 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Spatio-temporal detection of actions and events in video is a challeng-ing problem. Besides the difficulties related to recognition, a major challenge for detection in video is the size of the search space defined by spatio-temporal tubes formed by sequences of bounding boxes along the frames. Recently methods that generate unsupervised detection proposals have proven to be very effective for object detection in still images. These methods open the possibility to use strong but computationally expensive features since only a relatively small number of detection hypotheses need to be assessed. In this paper we make two contribu-tions towards exploiting detection proposals for spatio-temporal detection prob-lems. First, we extend a recent 2D object proposal method, to produce spatio-temporal proposals by a randomized supervoxel merging process. We introduce spatial, temporal, and spatio-temporal pairwise supervoxel features that are used to guide the merging process. Second, we propose a new efficient supervoxel method. We experimentally evaluate our detection proposals, in combination with our new supervoxel method as well as existing ones. This evaluation shows that our supervoxels lead to more accurate proposals when compared to using existing state-of-the-art supervoxel methods. 1
Action and Event Recognition with Fisher Vectors on a Compact Feature Set
"... Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex ..."
Abstract
-
Cited by 9 (5 self)
- Add to MetaCart
(Show Context)
Action recognition in uncontrolled video is an important and challenging computer vision problem. Recent progress in this area is due to new local features and models that capture spatio-temporal structure between local features, or human-object interactions. Instead of working towards more complex models, we focus on the low-level features and their encoding. We evaluate the use of Fisher vectors as an alternative to bag-of-word histograms to aggregate a small set of state-of-the-art low-level descriptors, in combination with linear classifiers. We present a large and varied set of evaluations, considering (i) classification of short actions in five datasets, (ii) localization of such actions in feature-length movies, and (iii) large-scale recognition of complex events. We find that for basic action recognition and localization MBH features alone are enough for stateof-the-art performance. For complex events we find that SIFT and MFCC features provide complementary cues. On all three problems we obtain state-of-the-art results, while using fewer features and less complex models. 1.
Multi-view super vector for action recognition
- in Proc. IEEE Conf. CVPR
, 2014
"... Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor conca ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor concatenation and kernel average. The first one is effective when different descriptors are strongly corre-lated, while the second one is probably better when de-scriptors are relatively independent. In practice, however, different descriptors are neither fully independent nor fully correlated, and previous fusion methods may not be satis-fying. In this paper, we propose a new global representa-tion, Multi-View Super Vector (MVSV), which is composed of relatively independent components derived from a pair of descriptors. Kernel average is then applied on these com-ponents to produce recognition result. To obtain MVSV, we develop a generative mixture model of probabilistic canoni-cal correlation analyzers (M-PCCA), and utilize the hidden factors and gradient vectors of M-PCCA to construct MVSV for video representation. Experiments on video based ac-tion recognition tasks show that MVSV achieves promising results, and outperforms FV and VLAD with descriptor con-catenation or kernel average fusion strategy. 1.
Action localization with tubelets from motion
"... This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the n ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
This paper considers the problem of action localization, where the objective is to determine when and where certain actions appear. We introduce a sampling strategy to produce 2D+t sequences of bounding boxes, called tubelets. Compared to state-of-the-art alternatives, this drastically reduces the number of hypotheses that are likely to include the action of interest. Our method is inspired by a recent technique introduced in the context of image localization. Beyond considering this technique for the first time for videos, we revisit this strategy for 2D+t sequences obtained from super-voxels. Our sampling strategy advantageously exploits a criterion that reflects how action related motion deviates from background motion. We demonstrate the interest of our approach by extensive experiments on two public datasets: UCF Sports and MSR-II. Our approach significantly outperforms the state-of-theart on both datasets, while restricting the search of actions to a fraction of possible bounding box sequences. 1.
Efficient Action Localization with Approximately Normalized Fisher Vectors
"... The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word represen-tation. Transformation of the FV by power and ℓ2 normal-izations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classifica ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
The Fisher vector (FV) representation is a high-dimensional extension of the popular bag-of-word represen-tation. Transformation of the FV by power and ℓ2 normal-izations has shown to significantly improve its performance, and led to state-of-the-art results for a range of image and video classification and retrieval tasks. These normaliza-tions, however, render the representation non-additive over local descriptors. Combined with its high dimensionality, this makes the FV computationally expensive for the pur-pose of localization tasks. In this paper we present approx-imations to both these normalizations, which yield signifi-cant improvements in the memory and computational costs of the FV when used for localization. Second, we show how these approximations can be used to define upper-bounds on the score function that can be efficiently evaluated, which enables the use of branch-and-bound search as an alterna-tive to exhaustive sliding window search. We present ex-perimental evaluation results on classification and tempo-ral localization of actions in videos. These show that the our approximations lead to a speedup of at least one or-der of magnitude, while maintaining state-of-the-art action recognition and localization performance. 1.
Action Recognition with Actons
"... With the improved accessibility to an exploding amoun-t of video data and growing demands in a wide range of video analysis applications, video-based action recogni-tion/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for acti ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
With the improved accessibility to an exploding amoun-t of video data and growing demands in a wide range of video analysis applications, video-based action recogni-tion/classification becomes an increasingly important task in computer vision. In this paper, we propose a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a new max-margin multi-channel multiple instance learning framework, which can capture multiple mid-level action concepts simultaneously. The learned actons (with no requirement for detailed manual annotations) observe the properties of being compact, infor-mative, discriminative, and easy to scale. The experimental results demonstrate the effectiveness of applying the learned actons in our two-layer structure, and show the state-of-the-art recognition performance on two challenging action datasets, i.e., Youtube and HMDB51. 1.
Towards good practices for action video encoding
- In ICCV
, 2013
"... High dimensional representations such as VLAD or FV have shown excellent accuracy in action recognition. This paper shows that a proper encoding built upon VLAD can achieve further accuracy boost with only negligible com-putational cost. We empirically evaluated various VLAD improvement technologies ..."
Abstract
-
Cited by 6 (1 self)
- Add to MetaCart
(Show Context)
High dimensional representations such as VLAD or FV have shown excellent accuracy in action recognition. This paper shows that a proper encoding built upon VLAD can achieve further accuracy boost with only negligible com-putational cost. We empirically evaluated various VLAD improvement technologies to determine good practices in VLAD-based video encoding. Furthermore, we propose an interpretation that VLAD is a maximum entropy linear feature learning process. Combining this new perspective with observed VLAD data distribution properties, we pro-pose a simple, lightweight, but powerful bimodal encod-ing method. Evaluated on 3 benchmark action recognition datasets (UCF101, HMDB51 and Youtube), the bimodal en-coding improves VLAD by large margins in action recogni-tion. 1.