Results 1 -
8 of
8
Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice
, 2014
"... ar ..."
(Show Context)
Action recognition with trajectory-pooled deep-convolutional descriptors
- In CVPR
, 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
The Influence of Temporal Information on Human Action Recognition with Large Number of Classes
"... Abstract—Human action recognition from video input has seen much interest over the last decade. In recent years, the trend is clearly towards action recognition in real-world, unconstrained conditions (i.e. not acted) with an ever growing number of action classes. Much of the work so far has used si ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Human action recognition from video input has seen much interest over the last decade. In recent years, the trend is clearly towards action recognition in real-world, unconstrained conditions (i.e. not acted) with an ever growing number of action classes. Much of the work so far has used single frames or sequences of frames where each frame was treated individually. This paper investigates the contribution that temporal informa-tion can make to human action recognition in the context of a large number of action classes. The key contributions are: (i) We propose a complementary information channel to the Bag-of-Words framework that models the temporal occurrence of the local information in videos. (ii) We investigate the influence of sensible local information whose temporal occurrence is more vital than any local information. The experimental validation on action recognition datasets with the largest number of classes to date shows the effectiveness of the proposed approach. I.
Group Membership Prediction
"... The group membership prediction (GMP) problem in-volves predicting whether or not a collection of instances share a certain semantic property. For instance, in kinship verification given a collection of images, the goal is to pre-dict whether or not they share a familial relationship. In this contex ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
The group membership prediction (GMP) problem in-volves predicting whether or not a collection of instances share a certain semantic property. For instance, in kinship verification given a collection of images, the goal is to pre-dict whether or not they share a familial relationship. In this context we propose a novel probability model and introduce latent view-specific and view-shared random variables to jointly account for the view-specific appearance and cross-view similarities among data instances. Our model posits that data from each view is independent conditioned on the shared variables. This postulate leads to a parametric probability model that decomposes group membership like-lihood into a tensor product of data-independent parame-ters and data-dependent factors. We propose learning the data-independent parameters in a discriminative way with bilinear classifiers, and test our prediction algorithm on challenging visual recognition tasks such as multi-camera person re-identification and kinship verification. On most benchmark datasets, our method can significantly outper-form the current state-of-the-art. 1.
MoFAP: A Multi-level Representation for Action Recognition
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract
- Add to MetaCart
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
Human Action Recognition in Unconstrained Videos by Explicit Motion Modeling
"... Abstract — Human action recognition in unconstrained videos is a challenging problem with many applications. Most state-of-the-art approaches adopted the well-known bag-of-features representations, generated based on isolated local patches or patch trajectories, where motion patterns, such as object ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract — Human action recognition in unconstrained videos is a challenging problem with many applications. Most state-of-the-art approaches adopted the well-known bag-of-features representations, generated based on isolated local patches or patch trajectories, where motion patterns, such as object–object and object-background relationships are mostly discarded. In this paper, we propose a simple representation aiming at modeling these motion relationships. We adopt global and local reference points to explicitly characterize motion information, so that the final representation is more robust to camera movements, which widely exist in unconstrained videos. Our approach operates on the top of visual codewords gener-ated on dense local patch trajectories, and therefore, does not require foreground–background separation, which is normally a critical and difficult step in modeling object relationships. Through an extensive set of experimental evaluations, we show that the proposed representation produces a very competitive performance on several challenging benchmark data sets. Further combining it with the standard bag-of-features or Fisher vector representations can lead to substantial improvements. Index Terms — Human action recognition, trajectory, motion representation, reference points, camera motion.
Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics
"... Abstract. Recent studies show that aggregating local descriptors into super vector yields effective representation for retrieval and classifica-tion tasks. A popular method along this line is vector of locally aggre-gated descriptors (VLAD), which aggregates the residuals between de-scriptors and vi ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract. Recent studies show that aggregating local descriptors into super vector yields effective representation for retrieval and classifica-tion tasks. A popular method along this line is vector of locally aggre-gated descriptors (VLAD), which aggregates the residuals between de-scriptors and visual words. However, original VLAD ignores high-order statistics of local descriptors and its dictionary may not be optimal for classification tasks. In this paper, we address these problems by utilizing high-order statistics of local descriptors and peforming supervised dictio-nary learning. The main contributions are twofold. Firstly, we propose a high-order VLAD (H-VLAD) for visual recognition, which leverages two kinds of high-order statistics in the VLAD-like framework, namely diagonal covariance and skewness. These high-order statistics provide complementary information for VLAD and allow for efficient computa-tion. Secondly, to further boost the performance of H-VLAD, we de-sign a supervised dictionary learning algorithm to discriminatively re-fine the dictionary, which can be also extended for other super vector based encoding methods. We examine the effectiveness of our methods in image-based object categorization and video-based action recognition. Extensive experiments on PASCAL VOC 2007, HMDB51, and UCF101 datasets exhibit that our method achieves the state-of-the-art perfor-mance on both tasks. 1