• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Latent Hierarchical Model of Temporal Structure for Complex Activity Classification

by Limin Wang, Yu Qiao, Xiaoou Tang
Add To MetaCart

Tools

Sorted by:
Results 1 - 8 of 8

Bag of Visual Words and Fusion Methods for Action Recognition: Comprehensive Study and Good Practice

by Xiaojiang Peng, Limin Wang, Xingxing Wang, Yu Qiao, Xiaojiang Peng, Limin Wang, Xingxing Wang, Yu Qiao , 2014
"... ar ..."
Abstract - Cited by 19 (10 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...parts or track them at each frame. However, the detection and tracking of body part is still an unsolved problem in realistic videos. Recently, recognition methods using local spatiotemporal features =-=[24,25,46,53]-=- have become the main stream and obtained the state-of-theart performance on many datasets [47]. These methods do not require algorithms to detect human body, which treat the action volume as a rigid ...

Mining motion atoms and phrases for complex action recognition

by Limin Wang, Yu Qiao Y, Xiaoou Tang - In ICCV , 2013
"... This paper proposes motion atom and phrase as a mid-level temporal “part ” for representing and classifying com-plex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of mul ..."
Abstract - Cited by 16 (11 self) - Add to MetaCart
This paper proposes motion atom and phrase as a mid-level temporal “part ” for representing and classifying com-plex action. Motion atom is defined as an atomic part of action, and captures the motion information of action video in a short temporal scale. Motion phrase is a temporal composite of multiple motion atoms with an AND/OR struc-ture, which further enhances the discriminative ability of motion atoms by incorporating temporal constraints in a longer scale. Specifically, given a set of weakly labeled action videos, we firstly design a discriminative clustering method to automatically discover a set of representative mo-tion atoms. Then, based on these motion atoms, we mine ef-fective motion phrases with high discriminative and repre-sentative power. We introduce a bottom-up phrase construc-tion algorithm and a greedy selection method for this min-ing task. We examine the classification performance of the motion atom and phrase based representation on two com-plex action datasets: Olympic Sports and UCF50. Experi-mental results show that our method achieves superior per-formance over recent published methods on both datasets. 1.
(Show Context)

Citation Context

...its richer temporal structures and is composed of a sequence of atomic actions. Recently, researches show that the temporal structures of complex action yield effective cues for action classification =-=[8, 15, 23, 26]-=-. As shown in Figure 1, from a long temporal scale, a complex action can be decomposed into a sequence of atomic motions. For instance, the sport action of highjump can be decomposed into running, jum...

Video action detection with relational dynamic-poselets

by Limin Wang, Yu Qiao, Xiaoou Tang - In ECCV , 2014
"... •Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights: ..."
Abstract - Cited by 12 (3 self) - Add to MetaCart
•Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights:
(Show Context)

Citation Context

...oselets. Dynamic-poselets capture both the pose configuration and motion pattern of local cuboids, which are suitable for action detection in video. Relational Model in Action. Several previous works =-=[9, 3, 19, 28, 18]-=- have considered the relations among parts for action recognition and detection. Lan et al. [9] detected 2D parts frame-by-frame with tracking constraints using CRF. Brendel et al. [3] proposed a spat...

Action recognition with trajectory-pooled deep-convolutional descriptors

by Limin Wang, Yu Qiao, Xiaoou Tang - In CVPR , 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract - Cited by 8 (5 self) - Add to MetaCart
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
(Show Context)

Citation Context

... previous hand-crafted features [31] and deeplearned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1. 1. Introduction Human action recognition =-=[1, 24, 31, 35, 36]-=- in videos attracts increasing research interests in computer vision community due to its potential applications in video surveillance, human computer interaction, and video content analysis. However,...

Multi-view super vector for action recognition

by Zhuowei Cai, Limin Wang, Xiaojiang Peng, Yu Qiao - in Proc. IEEE Conf. CVPR , 2014
"... Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor conca ..."
Abstract - Cited by 8 (4 self) - Add to MetaCart
Images and videos are often characterized by multiple types of local descriptors such as SIFT, HOG and HOF, each of which describes certain aspects of object feature. Recognition systems benefit from fusing multiple types of these descriptors. Two widely applied fusion pipelines are descriptor concatenation and kernel average. The first one is effective when different descriptors are strongly corre-lated, while the second one is probably better when de-scriptors are relatively independent. In practice, however, different descriptors are neither fully independent nor fully correlated, and previous fusion methods may not be satis-fying. In this paper, we propose a new global representa-tion, Multi-View Super Vector (MVSV), which is composed of relatively independent components derived from a pair of descriptors. Kernel average is then applied on these com-ponents to produce recognition result. To obtain MVSV, we develop a generative mixture model of probabilistic canoni-cal correlation analyzers (M-PCCA), and utilize the hidden factors and gradient vectors of M-PCCA to construct MVSV for video representation. Experiments on video based ac-tion recognition tasks show that MVSV achieves promising results, and outperforms FV and VLAD with descriptor con-catenation or kernel average fusion strategy. 1.
(Show Context)

Citation Context

... results, and outperforms FV and VLAD with descriptor concatenation or kernel average fusion strategy. 1. Introduction Action recognition has been an active research area due to its wide applications =-=[1, 33, 34, 36]-=-. Early research focus had been on datasets with limited size and relatively controlled settings, such as the KTH dataset [26], but later shifted to large and more realistic datasets such as the HMDB5...

Motion part regularization: improving action recognition via trajectory group selection

by Bingbing Ni, Pierre Moulin, Xiaokang Yang, Shuicheng Yan - in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition , 2015
"... Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important motion parts will lead to a more discriminative action representation. Inspired by recent advances in sen-tence regularization for text classification, we introduce a Motion Part Regularization framework to mine for discrim-inative groups of dense trajectories which form important motion parts. First, motion part candidates are generated by spatio-temporal grouping of densely extracted trajec-tories. Second, an objective function which encourages sparse selection for these trajectory groups is formulated together with an action class discriminative term. Then, we propose an alternative optimization algorithm to effi-ciently solve this objective function by introducing a set of auxiliary variables which correspond to the discriminative-ness weights of each motion part (trajectory group). These learned motion part weights are further utilized to form a discriminativeness weighted Fisher vector representation for each action sample for final classification. The proposed motion part regularization framework achieves the state-of-the-art performances on several action recognition bench-marks. 1.
(Show Context)

Citation Context

...IFV) [25] 62.2% (64.3%) 52.1%(57.2%) 83.3% (91.1%) Jiang et al. [6] 59.5% 40.7% 80.6% Jain et al. [5] 62.5% 52.1% 83.2% Motion Atoms/Phrases [27] (+low level) – – 79.5% (84.9%) LHM + Dense Trajectory =-=[28]-=- 59.9% – 83.2% Motion Actons [35] 61.4% 54.0% – Stacked Fisher Vector [15] – 66.8% – Motion Part Regularization (ours) 66.7% 65.5% 92.3% 3) the method proposed by Jain et al. [5] which decomposes visu...

MoFAP: A Multi-level Representation for Action Recognition

by Limin Wang, Yu Qiao, See Profile, Int J Comput Vis, Limin Wang, Yu Qiao, Xiaoou Tang, Limin Wang, Yu Qiao, Xiaoou Tang
"... All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately. ..."
Abstract - Add to MetaCart
All in-text references underlined in blue are linked to publications on ResearchGate, letting you access and read them immediately.
(Show Context)

Citation Context

...corporating spatial and temporal relations among these low-level features, such as Temporal Structure Model (Niebles et al. 2010), Variable-duration HMM (Tang et al. 2012), Latent Hierarchical Model (=-=Wang et al. 2014-=-a), and Segmental Grammar Model (Pirsiavash and Ramanan 2014). Most of these statistical models resort to iterative algorithms to estimate model parameters and approximate inference techniques to spee...

A Joint Evaluation of Dictionary Learning and Feature Encoding for Action Recognition

by Xiaojiang Peng, Limin Wang, Yu Qiao, Qiang Peng
"... Abstract—Many mid-level representations have been devel-oped to replace traditional bag-of-words model (VQ+k-means) such as sparse coding, OMP-k with k-SVD, and fisher vector with GMM in image domain. These approaches can be split into a dictionary learning phase and a feature encoding phase which a ..."
Abstract - Add to MetaCart
Abstract—Many mid-level representations have been devel-oped to replace traditional bag-of-words model (VQ+k-means) such as sparse coding, OMP-k with k-SVD, and fisher vector with GMM in image domain. These approaches can be split into a dictionary learning phase and a feature encoding phase which are often closely related. In this paper, we jointly evaluate the effect of these two phases for video-based action recognition. Specially, we compare several dictionary learning methods and feature encoding schemes through extensive experiments on the KTH and HMDB51 datasets. Experimental results indicate that fisher vector performs consistently better than the other encoding methods, and sparse coding is robust to different dictionaries even random weights. In addition, we observe that the advantages of sophisticated mid-level representations do not come from their specific dictionaries but the encoding mechanisms, and we can just use randomly selected exemplars as dictionaries for most of encoding methods. Finally, we achieve the state-of-the-art results on the HMDB51 and UCF101 by combining our configurations with improved dense trajectory features. I.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University