• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Behavior recognition via sparse spatio-temporal features. VS-PETS (2005)

by P Dollar, V Rabaud, G Cottrell, S Belongie
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 717
Next 10 →

Learning realistic human actions from movies

by Ivan Laptev, Marcin Marszałek, Cordelia Schmid, Benjamin Rozenfeld - IN: CVPR. , 2008
"... The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribut ..."
Abstract - Cited by 738 (48 self) - Add to MetaCart
The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multichannel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8 % accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results.
(Show Context)

Citation Context

...work we use more sophisticated text classification tools to overcome action variability in text. Similar to ours, several recent methods explore bag-of-features representations for action recognition =-=[3, 6, 13, 15, 19]-=-, but only address human actions in controlled and simplified settings. Recognition and localization of actions in movies has been recently addressed in [8] for a limited dataset, i.e., manual annotat...

Unsupervised learning of human action categories using spatial-temporal words

by Juan Carlos Niebles, Hongcheng Wang, Li Fei-fei - In Proc. BMVC , 2006
"... Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences ..."
Abstract - Cited by 494 (8 self) - Add to MetaCart
Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences is very useful for a variety of tasks, such as video surveillance, objectlevel video summarization, video indexing, digital library organization, etc. However, it remains a challenging task for computers to achieve robust action recognition due to cluttered background, camera motion, occlusion, and geometric and photometric variances of objects. For example, in a live video of a skating competition, the skater moves rapidly across the rink, and the camera also moves to follow the skater. With moving camera, non-stationary background, and moving target, few vision algorithms could identify, categorize and
(Show Context)

Citation Context

...the image values have significant local variations in both dimensions. The representation has been successfully applied to human action recognition combined with an SVM classifier [17]. Dollár et al. =-=[7]-=- propose an alternative approach to detect sparse space-time interest points based on separable linear filters for behavior recognition. Local space-time patches, therefore, have been proven useful to...

Action Recognition by Dense Trajectories

by Heng Wang , Alexander Kläser , Cordelia Schmid , Liu Cheng-lin , 2011
"... Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense ..."
Abstract - Cited by 293 (16 self) - Add to MetaCart
Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sam-ple dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajec-tories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the mo-tion information in videos well. We, also, investigate how to design descriptors to encode the trajectory information. We introduce a novel descriptor based on motion boundary histograms, which is robust to camera motion. This descriptor consistently outperforms other state-of-the-art descriptors, in particular in uncon-trolled realistic videos. We evaluate our video description in the context of action classification with a bag-of-features approach. Experimental results show a significant improve-ment over the state of the art on four datasets of varying difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.

Evaluation of local spatio-temporal features for action recognition

by Heng Wang, Muhammad Muneeb Ullah, Alexander Kläser, Ivan Laptev, Cordelia Schmid - University of Central Florida, U.S.A , 2009
"... Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of ex ..."
Abstract - Cited by 274 (25 self) - Add to MetaCart
Local space-time features have recently become a popular video representation for action recognition. Several methods for feature localization and description have been proposed in the literature and promising recognition results were demonstrated for a number of action classes. The comparison of existing methods, however, is often limited given the different experimental settings used. The purpose of this paper is to evaluate and compare previously proposed space-time features in a common experimental setup. In particular, we consider four different feature detectors and six local feature descriptors and use a standard bag-of-features SVM approach for action recognition. We investigate the performance of these methods on a total of 25 action classes distributed over three datasets with varying difficulty. Among interesting conclusions, we demonstrate that regular sampling of space-time features consistently outperforms all tested space-time interest point detectors for human actions in realistic settings. We also demonstrate a consistent ranking for the majority of methods over different datasets and discuss their advantages and limitations. 1
(Show Context)

Citation Context

...s are usually extracted directly from video and therefore avoid possible failures of other pre-processing methods such as motion segmentation and tracking. Many different space-time feature detectors =-=[6, 10, 14, 22, 26, 27]-=- and descriptors [12, 15, 16, 25, 26] have been proposed in the past few years. Feature detectors usually select c○ 2009. The copyright of this document resides with its authors. It may be distributed...

A biologically inspired system for action recognition

by H. Jhuang, T. Serre, L. Wolf, T. Poggio - In ICCV , 2007
"... We present a biologically-motivated system for the recognition of actions from video sequences. The approach builds on recent work on object recognition based on hierarchical feedforward architectures [25, 16, 20] and extends a neurobiological model of motion processing in the visual cortex [10]. Th ..."
Abstract - Cited by 238 (15 self) - Add to MetaCart
We present a biologically-motivated system for the recognition of actions from video sequences. The approach builds on recent work on object recognition based on hierarchical feedforward architectures [25, 16, 20] and extends a neurobiological model of motion processing in the visual cortex [10]. The system consists of a hierarchy of spatio-temporal feature detectors of increasing complexity: an input sequence is first analyzed by an array of motiondirection sensitive units which, through a hierarchy of processing stages, lead to position-invariant spatio-temporal feature detectors. We experiment with different types of motion-direction sensitive units as well as different system architectures. As in [16], we find that sparse features in intermediate stages outperform dense ones and that using a simple feature selection approach leads to an efficient system that performs better with far fewer features. We test the approach on different publicly available action datasets, in all cases achieving the highest results reported to date. 1.
(Show Context)

Citation Context

...ions. The idea of extending object descriptors to actions has already been shown to be a good one, as illustrated by the excellent performance in the nonbiologically motivated system by Dollar et al. =-=[5]-=-. 1.1. Our approach Our approach is closely related to feedforward hierarchical template matching architectures that have been used for the recognition of objects in still images. These systems have b...

A spatio-temporal descriptor based on 3d-gradients

by Alexander Kläser, Marcin Marszałek, Cordelia Schmid - In BMVC’08
"... In this work, we present a novel local descriptor for video sequences. The proposed descriptor is based on histograms of oriented 3D spatio-temporal gradients. Our contribution is four-fold. (i) To compute 3D gradients for arbitrary scales, we develop a memory-efficient algorithm based on integral v ..."
Abstract - Cited by 234 (6 self) - Add to MetaCart
In this work, we present a novel local descriptor for video sequences. The proposed descriptor is based on histograms of oriented 3D spatio-temporal gradients. Our contribution is four-fold. (i) To compute 3D gradients for arbitrary scales, we develop a memory-efficient algorithm based on integral videos. (ii) We propose a generic 3D orientation quantization which is based on regular polyhedrons. (iii) We perform an in-depth evaluation of all descriptor parameters and optimize them for action recognition. (iv) We apply our descriptor to various action datasets (KTH, Weizmann, Hollywood) and show that we outperform the state-of-the-art. 1
(Show Context)

Citation Context

...ude the paper in section 4. 1.1 Related work Local descriptors based on normalized pixel values, brightness gradients, and windowed optical flow were evaluated for action recognition by Dollár et al. =-=[4]-=-. Experiments on three datasets—KTH human actions, facial expressions and mouse behavior—show best results for gradient descriptors. Those descriptors, however, were computed by concatenating all grad...

Recognizing Realistic Actions from Videos “in the Wild”

by Jingen Liu, Jiebo Luo, Mubarak Shah
"... In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild. ” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous ..."
Abstract - Cited by 227 (13 self) - Add to MetaCart
In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild. ” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization. 1.
(Show Context)

Citation Context

...oes not require background subtraction and object tracking [3, 5], and can cope with certain camera motion and illumination changes, it is receiving increasing attention in generic action recognition =-=[8,9,10,11,12,29]-=-. Typically, spatiotemporal interest points are first detected either by a 3D Harris corner detector [11] or Gabor filters [12], and the descriptor vectors around those interest points are then comput...

Machine recognition of human activities: A survey

by Pavan Turaga, Rama Chellappa, V. S. Subrahmanian, Octavian Udrea , 2008
"... The past decade has witnessed a rapid proliferation of video cameras in all walks of life and has resulted in a tremendous explosion of video content. Several applications such as content-based video annotation and retrieval, highlight extraction and video summarization require recognition of the a ..."
Abstract - Cited by 218 (0 self) - Add to MetaCart
The past decade has witnessed a rapid proliferation of video cameras in all walks of life and has resulted in a tremendous explosion of video content. Several applications such as content-based video annotation and retrieval, highlight extraction and video summarization require recognition of the activities occurring in the video. The analysis of human activities in videos is an area with increasingly important consequences from security and surveillance to entertainment and personal archiving. Several challenges at various levels of processing—robustness against errors in low-level processing, view and rate-invariant representations at midlevel processing and semantic representation of human activities at higher level processing—make this problem hard to solve. In this review paper, we present a comprehensive survey of efforts in the past couple of decades to address the problems of representation, recognition, and learning of human activities from video and related applications. We discuss the problem at two major levels of complexity: 1) “actions ” and 2) “activities. ” “Actions ” are characterized by simple motion patterns typically executed by a single human. “Activities ” are more complex and involve coordinated actions among a small number of humans. We will discuss several approaches and classify them according to their ability to handle varying degrees of complexity as interpreted above. We begin with a discussion of approaches to model the simplest of action classes known as atomic or primitive actions that do not require sophisticated dynamical modeling. Then, methods to model actions with more complex dynamics are discussed. The discussion then leads naturally to methods for higher level representation of complex activities.
(Show Context)

Citation Context

...ivities. Figure taken from [20]. Fig. 6. Silhouettes extracted from the walking sequence shown in Fig. 1. Silhouettes encode sufficient information to recognize actions. Figure taken from [7]. et al. =-=[28]-=- extract distinctive periodic motion-based landmarks in a given video using a Gaussian kernel in space and a Gabor function in time. Because these approaches are based on simple convolution operations...

Human Activity Analysis: A Review

by J. K. Aggarwal, M. S. Ryoo - TO APPEAR. ACM COMPUTING SURVEYS.
"... Human activity recognition is an important area of computer vision research. Its applications include surveillance systems, patient monitoring systems, and a variety of systems that involve interactions between persons and electronic devices such as human-computer interfaces. Most of these applicati ..."
Abstract - Cited by 214 (6 self) - Add to MetaCart
Human activity recognition is an important area of computer vision research. Its applications include surveillance systems, patient monitoring systems, and a variety of systems that involve interactions between persons and electronic devices such as human-computer interfaces. Most of these applications require an automated recognition of high-level activities, composed of multiple simple (or atomic) actions of persons. This paper provides a detailed overview of various state-of-the-art research papers on human activity recognition. We discuss both the methodologies developed for simple human actions and those for high-level activities. An approach-based taxonomy is chosen, comparing the advantages and limitations of each approach. Recognition methodologies for an analysis of simple actions of a single person are first presented in the paper. Space-time volume approaches and sequential approaches that represent and recognize activities directly from input images are discussed. Next, hierarchical recognition methodologies for high-level activities are presented and compared. Statistical approaches, syntactic approaches, and description-based approaches for hierarchical recognition are discussed in the paper. In addition, we further discuss the papers on the recognition of human-object interactions and group activities. Public datasets designed for the evaluation of the recognition methodologies are illustrated in our paper as well, comparing the methodologies’ performances. This review will provide the impetus for future research in more productive areas.

Gool, L.: An efficient dense and scale-invariant spatiotemporal interest point detector

by Geert Willems, Tinne Tuytelaars, Luc Van Gool , 2008
"... Abstract. Over the years, several spatio-temporal interest point detectors have been proposed. While some detectors can only extract a sparse set of scale-invariant features, others allow for the detection of a larger amount of features at user-defined scales. This paper presents for the first time ..."
Abstract - Cited by 168 (3 self) - Add to MetaCart
Abstract. Over the years, several spatio-temporal interest point detectors have been proposed. While some detectors can only extract a sparse set of scale-invariant features, others allow for the detection of a larger amount of features at user-defined scales. This paper presents for the first time spatio-temporal interest points that are at the same time scale-invariant (both spatially and temporally) and densely cover the video content. Moreover, as opposed to earlier work, the fea-tures can be computed efficiently. Applying scale-space theory, we show that this can be achieved by using the determinant of the Hessian as the saliency measure. Computations are speeded-up further through the use of approximative box-filter operations on an integral video structure. A quantitative evaluation and experi-mental results on action recognition show the strengths of the proposed detector in terms of repeatability, accuracy and speed, in comparison with previously pro-posed detectors. 1
(Show Context)

Citation Context

...mporal) scale change of 22%, a 2D spatial scaling of 35%, a 1D temporal scaling of 88%. 3.2 Discussion We compare our Hes-STIP detector with the HL-STIP detector of [5] at multiple scales and cuboids =-=[6]-=- extracted both at a single scale and at multiple scales. For all of these, we have used executables made available by the respective authors with default parameters. The multi-scale versions were run...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University