Results 1 - 10
of
145
Action Recognition by Dense Trajectories
, 2011
"... Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense ..."
Abstract
-
Cited by 293 (16 self)
- Add to MetaCart
Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sam-ple dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajec-tories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the mo-tion information in videos well. We, also, investigate how to design descriptors to encode the trajectory information. We introduce a novel descriptor based on motion boundary histograms, which is robust to camera motion. This descriptor consistently outperforms other state-of-the-art descriptors, in particular in uncon-trolled realistic videos. We evaluate our video description in the context of action classification with a bag-of-features approach. Experimental results show a significant improve-ment over the state of the art on four datasets of varying difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.
Dense trajectories and motion boundary descriptors for action recognition
- INT. J. COMPUT. VISION
, 2013
"... ..."
Discovering discriminative action parts from mid-level video representations
- In IEEE CVPR
"... We describe a mid-level approach for action recognition. From an input video, we extract salient spatio-temporal structures by forming clusters of trajectories that serve as candidates for the parts of an action. The assembly of these clusters into an action class is governed by a graphical model th ..."
Abstract
-
Cited by 61 (7 self)
- Add to MetaCart
(Show Context)
We describe a mid-level approach for action recognition. From an input video, we extract salient spatio-temporal structures by forming clusters of trajectories that serve as candidates for the parts of an action. The assembly of these clusters into an action class is governed by a graphical model that incorporates appearance and motion constraints for the individual parts and pairwise constraints for the spatio-temporal dependencies among them. During training, we estimate the model parameters discriminatively. During classification, we efficiently match the model to a video using discrete optimization. We validate the model’s classification ability in standard benchmark datasets and illustrate its potential to support a fine-grained analysis that not only gives a label to a video, but also identifies and localizes its constituent parts. 1.
Key-segments for video object segmentation
- In ICCV
, 2011
"... We present an approach to discover and segment foreground object(s) in video. Given an unannotated video sequence, the method first identifies object-like regions in any frame according to both static and dynamic cues. We then compute a series of binary partitions among those candidate “key-segments ..."
Abstract
-
Cited by 60 (3 self)
- Add to MetaCart
(Show Context)
We present an approach to discover and segment foreground object(s) in video. Given an unannotated video sequence, the method first identifies object-like regions in any frame according to both static and dynamic cues. We then compute a series of binary partitions among those candidate “key-segments ” to discover hypothesis groups with persistent appearance and motion. Finally, using each ranked hypothesis in turn, we estimate a pixel-level object labeling across all frames, where (a) the foreground likelihood depends on both the hypothesis’s appearance as well as a novel localization prior based on partial shape matching, and (b) the background likelihood depends on cues pulled from the key-segments ’ (possibly diverse) surroundings observed across the sequence. Compared to existing methods, our approach automatically focuses on the persistent foreground regions of interest while resisting oversegmentation. We apply our method to challenging benchmark videos, and show competitive or better results than the state-of-the-art. 1.
Track to the Future: Spatio-temporal Video Segmentation with Long-range Motion Cues
"... Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. We aim to capture such interactions and to construct a powerful intermediatelevel video representation for subsequent recognition. Motivated by this goa ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
(Show Context)
Video provides not only rich visual cues such as motion and appearance, but also much less explored long-range temporal interactions among objects. We aim to capture such interactions and to construct a powerful intermediatelevel video representation for subsequent recognition. Motivated by this goal, we seek to obtain spatio-temporal oversegmentation of a video into regions that respect object boundaries and, at the same time, associate object pixels over many video frames. The contributions of this paper are two-fold. First, we develop an efficient spatiotemporal video segmentation algorithm, which naturally incorporates long-range motion cues from the past and future frames in the form of clusters of point tracks with coherent motion. Second, we devise a new track clustering cost function that includes occlusion reasoning, in the form of depth ordering constraints, as well as motion similarity along the tracks. We evaluate the proposed approach on a challenging set of video sequences of office scenes from feature length movies. 1.
Learning object class detectors from weakly annotated video
- IN INTERNATIONAL CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR
, 2012
"... Object detectors are typically trained on a large set of still images annotated by bounding-boxes. This paper introduces an approach for learning object detectors from real-world web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects ..."
Abstract
-
Cited by 52 (10 self)
- Add to MetaCart
(Show Context)
Object detectors are typically trained on a large set of still images annotated by bounding-boxes. This paper introduces an approach for learning object detectors from real-world web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatiotemporal tubes based on motion segmentation and then selects one tube per video jointly over all videos. To compare to the state of the art, we test our detector on still images, i.e., Pascal VOC 2007. We observe that frames extracted from web videos can differ significantly in terms of quality to still images taken by a good camera. Thus, we formulate the learning from videos as a domain adaptation task. We show that training from a combination of weakly annotated videos and fully annotated still images using domain adaptation improves the performance of a detector trained from still images alone.
P.: Better Exploiting Motion for Better Action Recognition
- Computer Vision and Pattern Recognition (CVPR
, 2013
"... Several recent works on action recognition have attested the importance of explicitly integrating motion character-istics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectori ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
(Show Context)
Several recent works on action recognition have attested the importance of explicitly integrating motion character-istics in the video description. This paper establishes that adequately decomposing visual motion into dominant and residual motions, both in the extraction of the space-time trajectories and for the computation of descriptors, signif-icantly improves action recognition algorithms. Then, we design a new motion descriptor, the DCS descriptor, based on differential motion scalar quantities, divergence, curl and shear features. It captures additional information on the local motion patterns enhancing results. Finally, ap-plying the recent VLAD coding technique proposed in im-age retrieval provides a substantial improvement for action recognition. Our three contributions are complementary and lead to outperform all reported results by a significant margin on three challenging datasets, namely Hollywood 2, HMDB51 and Olympic Sports. 1. Introduction and
Object Segmentation in Video: A Hierarchical Variational Approach for Turning Point Trajectories into Dense Regions
"... Point trajectories have emerged as a powerful means to obtain high quality and fully unsupervised segmentation of objects in video shots. They can exploit the long term motion difference between objects, but they tend to be sparse due to computational reasons and the difficulty in estimating motion ..."
Abstract
-
Cited by 34 (5 self)
- Add to MetaCart
(Show Context)
Point trajectories have emerged as a powerful means to obtain high quality and fully unsupervised segmentation of objects in video shots. They can exploit the long term motion difference between objects, but they tend to be sparse due to computational reasons and the difficulty in estimating motion in homogeneous areas. In this paper we introduce a variational method to obtain dense segmentations from such sparse trajectory clusters. Information is propagated with a hierarchical, nonlinear diffusion process that runs in the continuous domain but takes superpixels into account. We show that this process raises the density from 3% to 100 % and even increases the average precision of labels. 1.
Maximum weight cliques with mutex constraints for video object segmentation
, 2012
"... In this paper, we address the problem of video object segmentation, which is to automatically identify the primary object and segment the object out in every frame. We propose a novel formulation of selecting object region candidates simultaneously in all frames as finding a maximum weight clique in ..."
Abstract
-
Cited by 25 (3 self)
- Add to MetaCart
In this paper, we address the problem of video object segmentation, which is to automatically identify the primary object and segment the object out in every frame. We propose a novel formulation of selecting object region candidates simultaneously in all frames as finding a maximum weight clique in a weighted region graph. The selected regions are expected to have high objectness score (unary potential) as well as share similar appearance (binary potential). Since both unary and binary potentials are unreliable, we introduce two types of mutex (mutual exclusion) constraints on regions in the same clique: intra-frame and inter-frame constraints. Both types of constraints are expressed in a single quadratic form. We propose a novel algorithm to compute the maximal weight cliques that satisfy the constraints. We apply our method to challenging benchmark videos and obtain very competitive results that outperform state-of-the-art methods.
Higher order motion models and spectral clustering
- In CVPR
, 2012
"... Motion segmentation based on point trajectories can integrate information of a whole video shot to detect and separate moving objects. Commonly, similarities are defined between pairs of trajectories. However, pairwise similarities restrict the motion model to translations. Nontranslational motion, ..."
Abstract
-
Cited by 25 (2 self)
- Add to MetaCart
(Show Context)
Motion segmentation based on point trajectories can integrate information of a whole video shot to detect and separate moving objects. Commonly, similarities are defined between pairs of trajectories. However, pairwise similarities restrict the motion model to translations. Nontranslational motion, such as rotation or scaling, is penalized in such an approach. We propose to define similarities on higher order tuples rather than pairs, which leads to hypergraphs. To apply spectral clustering, the hypergraph is transferred to an ordinary graph, an operation that can be interpreted as a projection. We propose a specific nonlinear projection via a regularized maximum operator, and show that it yields significant improvements both compared to pairwise similarities and alternative hypergraph projections. 1.