Results 1 -
8 of
8
Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice
, 2014
"... ar ..."
(Show Context)
Mining motion atoms and phrases for complex action recognition
- In ICCV
, 2013
"... This paper proposes motion atom and phrase as a mid-level temporal “part ” for representing and classifying com-plex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of mul ..."
Abstract
-
Cited by 16 (11 self)
- Add to MetaCart
(Show Context)
This paper proposes motion atom and phrase as a mid-level temporal “part ” for representing and classifying com-plex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR struc-ture, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discover a set of representative mo-tion atoms. Then, based on these motion atoms, we mine ef-fective motion phrases with high discriminative and repre-sentative power. We introduce a bottom-up phrase construc-tion algorithm and a greedy selection method for this min-ing task. We examine the classification performance of the motion atom and phrase based representation on two com-plex action datasets: Olympic Sports and UCF50. Experi-mental results show that our method achieves superior per-formance over recent published methods on both datasets. 1.
Video action detection with relational dynamic-poselets
- In ECCV
, 2014
"... •Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights: ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
•Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights:
Action recognition with trajectory-pooled deep-convolutional descriptors
- In CVPR
, 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
Multi-view super vector for action recognition
- in Proc. IEEE Conf. CVPR
, 2014
"... Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor conca ..."
Abstract
-
Cited by 8 (4 self)
- Add to MetaCart
(Show Context)
Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor concatenation and kernel average. The first one is effective when different descriptors are strongly corre-lated, while the second one is probably better when de-scriptors are relatively independent. In practice, however, different descriptors are neither fully independent nor fully correlated, and previous fusion methods may not be satis-fying. In this paper, we propose a new global representa-tion, Multi-View Super Vector (MVSV), which is composed of relatively independent components derived from a pair of descriptors. Kernel average is then applied on these com-ponents to produce recognition result. To obtain MVSV, we develop a generative mixture model of probabilistic canoni-cal correlation analyzers (M-PCCA), and utilize the hidden factors and gradient vectors of M-PCCA to construct MVSV for video representation. Experiments on video based ac-tion recognition tasks show that MVSV achieves promising results, and outperforms FV and VLAD with descriptor con-catenation or kernel average fusion strategy. 1.
Motion part regularization: improving action recognition via trajectory group selection
- in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, 2015
"... Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important motion parts will lead to a more discriminative action representation. Inspired by recent advances in sen-tence regularization for text classification, we introduce a Motion Part Regularization framework to mine for discrim-inative groups of dense trajectories which form important motion parts. First, motion part candidates are generated by spatio-temporal grouping of densely extracted trajec-tories. Second, an objective function which encourages sparse selection for these trajectory groups is formulated together with an action class discriminative term. Then, we propose an alternative optimization algorithm to effi-ciently solve this objective function by introducing a set of auxiliary variables which correspond to the discriminative-ness weights of each motion part (trajectory group). These learned motion part weights are further utilized to form a discriminativeness weighted Fisher vector representation for each action sample for final classification. The proposed motion part regularization framework achieves the state-of-the-art performances on several action recognition bench-marks. 1.
MoFAP: A Multi-level Representation for Action Recognition
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
- Add to MetaCart
(Show Context)
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
A Joint Evaluation of Dictionary Learning and Feature Encoding for Action Recognition
"... Abstract—Many mid-level representations have been devel-oped to replace traditional bag-of-words model (VQ+k-means) such as sparse coding, OMP-k with k-SVD, and fisher vector with GMM in image domain. These approaches can be split into a dictionary learning phase and a feature encoding phase which a ..."
Abstract
- Add to MetaCart
Abstract—Many mid-level representations have been devel-oped to replace traditional bag-of-words model (VQ+k-means) such as sparse coding, OMP-k with k-SVD, and fisher vector with GMM in image domain. These approaches can be split into a dictionary learning phase and a feature encoding phase which are often closely related. In this paper, we jointly evaluate the effect of these two phases for video-based action recognition. Specially, we compare several dictionary learning methods and feature encoding schemes through extensive experiments on the KTH and HMDB51 datasets. Experimental results indicate that fisher vector performs consistently better than the other encoding methods, and sparse coding is robust to different dictionaries even random weights. In addition, we observe that the advantages of sophisticated mid-level representations do not come from their specific dictionaries but the encoding mechanisms, and we can just use randomly selected exemplars as dictionaries for most of encoding methods. Finally, we achieve the state-of-the-art results on the HMDB51 and UCF101 by combining our configurations with improved dense trajectory features. I.