Results 1 -
6 of
6
Action recognition with trajectory-pooled deep-convolutional descriptors
- In CVPR
, 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
Submodular Attribute Selection for Action Recognition
- in Video,” NIPS
, 2014
"... In real-world action recognition problems, low-level features cannot adequately characterize the rich spatial-temporal structures in action videos. In this work, we encode actions based on attributes that describes actions as high-level con-cepts e.g., jump forward or motion in the air. We base our ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
In real-world action recognition problems, low-level features cannot adequately characterize the rich spatial-temporal structures in action videos. In this work, we encode actions based on attributes that describes actions as high-level con-cepts e.g., jump forward or motion in the air. We base our analysis on two types of action attributes. One type of action attributes is generated by humans. The second type is data-driven attributes, which are learned from data using dictio-nary learning methods. Attribute-based representation may exhibit high variance due to noisy and redundant attributes. We propose a discriminative and compact attribute-based representation by selecting a subset of discriminative attributes from a large attribute set. Three attribute selection criteria are proposed and for-mulated as a submodular optimization problem. A greedy optimization algorithm is presented and guaranteed to be at least (1-1/e)-approximation to the optimum. Experimental results on the Olympic Sports and UCF101 datasets demonstrate that the proposed attribute-based representation can significantly boost the perfor-mance of action recognition algorithms and outperform most recently proposed recognition approaches. 1
Can Humans Fly? Action Understanding with Multiple Classes of Actors
"... Can humans fly? Emphatically no. Can cars eat? Again, absolutely not. Yet, these absurd inferences result from the current disregard for particular types of actors in action understanding. There is no work we know of on simulta-neously inferring actors and actions in the video, not to mention a data ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Can humans fly? Emphatically no. Can cars eat? Again, absolutely not. Yet, these absurd inferences result from the current disregard for particular types of actors in action understanding. There is no work we know of on simulta-neously inferring actors and actions in the video, not to mention a dataset to experiment with. Our paper hence marks the first effort in the computer vision community to jointly consider various types of actors undergoing various actions. To start with the problem, we collect a dataset of 3782 videos from YouTube and label both pixel-level actors and actions in each video. We formulate the general actor-action understanding problem and instantiate it at vari-ous granularities: both video-level single- and multiple-label actor-action recognition and pixel-level actor-action semantic segmentation. Our experiments demonstrate that inference jointly over actors and actions outperforms infer-ence independently over them, and hence concludes our ar-gument of the value of explicit consideration of various ac-tors in comprehensive action understanding. 1.
Motion part regularization: improving action recognition via trajectory group selection
- in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, 2015
"... Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important motion parts will lead to a more discriminative action representation. Inspired by recent advances in sen-tence regularization for text classification, we introduce a Motion Part Regularization framework to mine for discrim-inative groups of dense trajectories which form important motion parts. First, motion part candidates are generated by spatio-temporal grouping of densely extracted trajec-tories. Second, an objective function which encourages sparse selection for these trajectory groups is formulated together with an action class discriminative term. Then, we propose an alternative optimization algorithm to effi-ciently solve this objective function by introducing a set of auxiliary variables which correspond to the discriminative-ness weights of each motion part (trajectory group). These learned motion part weights are further utilized to form a discriminativeness weighted Fisher vector representation for each action sample for final classification. The proposed motion part regularization framework achieves the state-of-the-art performances on several action recognition bench-marks. 1.
A Two-Layer Representation For Large-Scale Action Recognition
"... With the improved accessibility to an exploding amoun-t of realistic video data and growing demands in many video analysis applications, video-based large-scale action recognition is becoming an increasingly important task in computer vision. In this notebook paper, we give a brief description on ou ..."
Abstract
- Add to MetaCart
(Show Context)
With the improved accessibility to an exploding amoun-t of realistic video data and growing demands in many video analysis applications, video-based large-scale action recognition is becoming an increasingly important task in computer vision. In this notebook paper, we give a brief description on our method for large-scale human action recognition in realistic videos. This method is original-ly proposed in our ICCV 2013 paper (“Action Recogni-tion with Actons”), which presented a two-layer structure for action recognition to automatically exploit a mid-level “acton ” representation. The weakly-supervised actons are learned via a max-margin multi-channel multiple instance learning algorithm (called M4IL), which can capture multi-ple mid-level action concepts simultaneously for producing
Part Bricolage: Flow-Assisted Part-Based Graphs for Detecting Activities in Videos
"... Abstract. Space-time detection of human activities in videos can significantly enhance visual search. To handle such tasks, while solely using low-level fea-tures has been found somewhat insufficient for complex datasets; mid-level fea-tures (like body parts) that are normally considered, are not ro ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Space-time detection of human activities in videos can significantly enhance visual search. To handle such tasks, while solely using low-level fea-tures has been found somewhat insufficient for complex datasets; mid-level fea-tures (like body parts) that are normally considered, are not robustly accounted for their inaccuracy. Moreover, the activity detection mechanisms do not con-structively utilize the importance and trustworthiness of the features. This paper addresses these problems and introduces a unified formulation for robustly detecting activities in videos. Our first contribution is the formulation of the detection task as an undirected node- and edge-weighted graphical struc-ture called Part Bricolage (PB), where the node weights represent the type of features along with their importance, and edge weights incorporate the probabil-ity of the features belonging to a known activity class, while also accounting for the trustworthiness of the features connecting the edge. Prize-Collecting-Steiner-Tree (PCST) problem [19] is solved for such a graph that gives the best connected subgraph comprising the activity of interest. Our second contribution is a novel technique for robust body part estimation, which uses two types of state-of-the-art pose detectors, and resolves the plausible detection ambiguities with pre-trained classifiers that predict the trustworthiness of the pose detectors. Our third con-tribution is the proposal of fusing the low-level descriptors with the mid-level ones, while maintaining the spatial structure between the features. For a quantitative evaluation of the detection power of PB, we run PB on Hollywood and MSR-Actions datasets and outperform the state-of-the-art by a significant margin for various detection paradigms.