Results 1 - 10
of
234
Action Recognition by Dense Trajectories
, 2011
"... Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense ..."
Abstract
-
Cited by 293 (16 self)
- Add to MetaCart
Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sam-ple dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajec-tories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the mo-tion information in videos well. We, also, investigate how to design descriptors to encode the trajectory information. We introduce a novel descriptor based on motion boundary histograms, which is robust to camera motion. This descriptor consistently outperforms other state-of-the-art descriptors, in particular in uncon-trolled realistic videos. We evaluate our video description in the context of action classification with a bag-of-features approach. Experimental results show a significant improve-ment over the state of the art on four datasets of varying difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.
Evaluation of local spatio-temporal features for action recognition
- University of Central Florida, U.S.A
, 2009
"... Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of ex ..."
Abstract
-
Cited by 274 (25 self)
- Add to MetaCart
(Show Context)
Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited given the different experimental settings used. The purpose of this paper is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations. 1
Action Bank: A High-Level Representation of Activity in Video
"... Activity recognition in video is dominated by low- and mid-level features, and while demonstrably capable, by nature, these features carry little semantic meaning. Inspired by the recent object bank approach to image representation, we present Action Bank, a new high-level representation of video. A ..."
Abstract
-
Cited by 170 (8 self)
- Add to MetaCart
(Show Context)
Activity recognition in video is dominated by low- and mid-level features, and while demonstrably capable, by nature, these features carry little semantic meaning. Inspired by the recent object bank approach to image representation, we present Action Bank, a new high-level representation of video. Action bank is comprised of many individual action detectors sampled broadly in semantic space as well as viewpoint space. Our representation is constructed to be semantically rich and even when paired with simple linear SVM classifiers is capable of highly discriminative performance. We have tested action bank on four major activity recognition benchmarks. In all cases, our performance is significantly better than the state of the art, namely 98.2% on KTH (better by 3.3%), 95.0 % on UCF Sports (better by
Appendix Learning Hierarchical Invariant Spatio-Temporal Features for Action Recognition with Independent Subspace Analysis
, 2011
"... In the experiment section of the paper, we present the performance of the convolutional ISA model on 4 human action recognition datasets. Due to the space limitations in our paper, we clarify the setup, parameters, and evaluation metric of experiments in this Appendix. For all the datasets, we use t ..."
Abstract
-
Cited by 149 (5 self)
- Add to MetaCart
(Show Context)
In the experiment section of the paper, we present the performance of the convolutional ISA model on 4 human action recognition datasets. Due to the space limitations in our paper, we clarify the setup, parameters, and evaluation metric of experiments in this Appendix. For all the datasets, we use the same pipeline with Wang et. al [5]. This pipleline first extracts features
Learning a hierarchy of discriminative space-time neighborhood features for human action recognition
- In IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2010
"... Recent work shows how to use local spatio-temporal features to learn models of realistic human actions from video. However, existing methods typically rely on a predefined spatial binning of the local descriptors to impose spatial information beyond a pure “bag-of-words ” model, and thus may fail to ..."
Abstract
-
Cited by 144 (0 self)
- Add to MetaCart
(Show Context)
Recent work shows how to use local spatio-temporal features to learn models of realistic human actions from video. However, existing methods typically rely on a predefined spatial binning of the local descriptors to impose spatial information beyond a pure “bag-of-words ” model, and thus may fail to capture the most informative space-time relationships. We propose to learn the shapes of space-time feature neighborhoods that are most discriminative for a given action category. Given a set of training videos, our method first extracts local motion and appearance features, quantizes them to a visual vocabulary, and then forms candidate neighborhoods consisting of the words associated with nearby points and their orientation with respect to the central interest point. Rather than dictate a particular scaling of the spatial and temporal dimensions to determine which points are near, we show how to learn the class-specific distance functions that form the most informative configurations. Descriptors for these variable-sized neighborhoods are then recursively mapped to higher-level vocabularies, producing a hierarchy of space-time configurations at successively broader scales. Our approach yields state-of-theart performance on the UCF Sports and KTH datasets. 1.
Action Recognition with Improved Trajectories
- in IEEE International Conference on Computer Vision (ICCV
, 2013
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 109 (11 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
A Survey of Vision-Based Methods for Action Representation, Segmentation and Recognition
, 2011
"... ..."
Surface Feature Detection and Description with Applications to Mesh Matching
, 2009
"... In this paper we revisit local feature detectors/descriptors developed for 2D images and extend them to the more general framework of scalar fields defined on 2D manifolds. We provide methods and tools to detect and describe features on surfaces equiped with scalar functions, such as photometric inf ..."
Abstract
-
Cited by 89 (0 self)
- Add to MetaCart
In this paper we revisit local feature detectors/descriptors developed for 2D images and extend them to the more general framework of scalar fields defined on 2D manifolds. We provide methods and tools to detect and describe features on surfaces equiped with scalar functions, such as photometric information. This is motivated by the growing need for matching and tracking photometric surfaces over temporal sequences, due to recent advancements in multiple camera 3D reconstruction. We propose a 3D feature detector (MeshDOG) and a 3D feature descriptor (MeshHOG) for uniformly triangulated meshes, invariant to changes in rotation, translation, and scale. The descriptor is able to capture the local geometric and/or photometric properties in a succinct fashion. Moreover, the method is defined generically for any scalar function, e.g., local curvature. Results with matching rigid and non-rigid meshes demonstrate the interest of the proposed framework.
Learning latent temporal structure for complex event detection
- In CVPR
, 2012
"... In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interestin ..."
Abstract
-
Cited by 75 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model. 1.
Local Trinary Patterns for Human Action Recognition
"... We present a novel action recognition method which is based on combining the effective description properties of Local Binary Patterns with the appearance invariance and adaptability of patch matching based methods. The resulting method is extremely efficient, and thus is suitable for real-time uses ..."
Abstract
-
Cited by 70 (2 self)
- Add to MetaCart
(Show Context)
We present a novel action recognition method which is based on combining the effective description properties of Local Binary Patterns with the appearance invariance and adaptability of patch matching based methods. The resulting method is extremely efficient, and thus is suitable for real-time uses of simultaneous recovery of human action of several lengths and starting points. Tested on all publicity available datasets in the literature known to us, our system repeatedly achieves state of the art performance. Lastly, we present a new benchmark that focuses on uncut motion recognition in broadcast sports video. 1.