• Documents
  • Authors
  • Tables

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. arXiv preprint arXiv:1405.4506, (2014)

by X Peng, L Wang, X Wang, Y Qiao
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 19
Next 10 →

Two-stream convolutional networks for action recognition in videos

by Karen Simonyan, Andrew Zisserman - CoRR
"... We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design as ..."
Abstract - Cited by 43 (3 self) - Add to MetaCart
We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architec-ture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification. 1
(Show Context)

Citation Context

...ng datasets (UCF-101 [20] and HMDB-51 [15]) show that the two video recognition streams are complementary, and our architecture significantly outperforms that of [13] and matches the state of the art =-=[18, 22, 23]-=-, in spite of being trained on a relatively small (for a large-capacity ConvNet) dataset. The rest of the paper is organised as follows. In Sect. 1.1 we review the related work on action recognition u...

Video action detection with relational dynamic-poselets

by Limin Wang, Yu Qiao, Xiaoou Tang - In ECCV , 2014
"... •Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights: ..."
Abstract - Cited by 12 (3 self) - Add to MetaCart
•Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights:
(Show Context)

Citation Context

...mputer interaction, and content-based retrieval. Most of the research efforts have been devoted to the problem of action recognition using the Bag of Visual Words (BoVW) framework or variants thereof =-=[29, 24, 11]-=-. These particular designed methods for action recognition usually require a short video clip to be cropped from a continuous video stream. Apart from the class label, however, they cannot provide fur...

Action recognition with trajectory-pooled deep-convolutional descriptors

by Limin Wang, Yu Qiao, Xiaoou Tang - In CVPR , 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract - Cited by 8 (5 self) - Add to MetaCart
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
(Show Context)

Citation Context

...TIP+BoVW [15] 23.0% STIP+BoVW [26] 43.9% Motionlets [35] 42.1% Deep Net [12] 63.3% DT+BoVW [30] 46.6% DT+VLAD [3] 79.9% DT+MVSV [3] 55.9% DT+MVSV [3] 83.5% iDT+FV [31] 57.2% iDT+FV [32] 85.9% iDT+HSV =-=[21]-=- 61.1% iDT+HSV [21] 87.9% Two Stream [24] 59.4% Two Stream [24] 88.0% TDD+FV 63.2% TDD+FV 90.3% Our best result 65.9% Our best result 91.5% Table 4. Comparison of TDD to the state of the art. We separ...

Action and gesture temporal spotting with super vector representation

by Xiaojiang Peng, Limin Wang, Zhuowei Cai, Yu Qiao - In ECCV, ChaLearn Looking at People Workshop , 2014
"... Abstract. This paper focuses on describing our method designed for both track 2 and track 3 at Looking at People (LAP) challenging [1]. We propose an action and gesture spotting system, which is mainly com-posed of three steps: (i) temporal segmentation, (ii) clip classification, and (iii) post proc ..."
Abstract - Cited by 5 (1 self) - Add to MetaCart
Abstract. This paper focuses on describing our method designed for both track 2 and track 3 at Looking at People (LAP) challenging [1]. We propose an action and gesture spotting system, which is mainly com-posed of three steps: (i) temporal segmentation, (ii) clip classification, and (iii) post processing. For track 2, we resort to a simple sliding win-dow method to divide each video sequence into clips, while for track 3, we design a segmentation method based on the motion analysis of human hands. Then, for each clip, we choose a kind of super vector representa-tion with dense features. Based on this representation, we train a linear SVM to conduct action and gesture recognition. Finally, we use some post processing techniques to void the detection of false positives. We demonstrate the effectiveness of our proposed method by participating the contests of both track 2 and track 3. We obtain the best performance on track 2 and rank 4th on track 3, which indicates that the designed system is effective for action and gesture recognition.
(Show Context)

Citation Context

...f importance for action and gesture recognition. With these low-level descriptors, we adopt the Bag of Visual Words [10] model to obtain the global representation. According to the recent study works =-=[11, 12]-=-, super vector based encoding methods are very effective by aggregating different order statistics in a high-dimensional feature representation. Specifically, we choose Fisher Vector [9] as the encodi...

Beyond gaussian pyramid: Multi-skip feature stacking for action recognition

by Zhenzhong Lan, Ming Lin, Xuanchong Li, Er G. Hauptmann, Bhiksha Raj - In CVPR , 2015
"... Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This atten-uation introduces bias to the resulting features and gener-ates ill-conditioned feature matrices. The Gaussian Pyra-mid has ..."
Abstract - Cited by 4 (0 self) - Add to MetaCart
Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This atten-uation introduces bias to the resulting features and gener-ates ill-conditioned feature matrices. The Gaussian Pyra-mid has been used as a feature enhancing technique that en-codes scale-invariant characteristics into the feature space in an attempt to deal with this attenuation. However, at the core of the Gaussian Pyramid is a convolutional smooth-ing operation, which makes it incapable of generating new features at coarse scales. In order to address this prob-lem, we propose a novel feature enhancing technique called Multi-skIp Feature Stacking (MIFS), which stacks features extracted using a family of differential filters parameterized with multiple time skips and encodes shift-invariance into the frequency space. MIFS compensates for information lost from using differential operators by recapturing infor-mation at coarse scales. This recaptured information al-lows us to match actions at different speeds and ranges of motion. We prove that MIFS enhances the learnability of differential-based features exponentially. The resulting fea-ture matrices from MIFS have a much smaller conditional numbers and variances than those from conventional meth-ods. Experimental results show significantly improved per-formance on challenging action recognition and event de-tection tasks. Specifically, our method exceeds the state-of-
(Show Context)

Citation Context

...ry method proposed by Wang et al. [38, 40], together with the Fisher Vector encoding [26] yields the current state-of-theart performances on several benchmark action recognition datasets. Peng et al. =-=[25]-=- further improved the performance of Dense Trajectory by increasing the codebook sizes and fusing multiple coding methods. Some success has been reported recently using deep convolutional neural netwo...

A discriminative cnn video representation for event detection

by Zhongwen Xu, Yi Yang Alex, Er G. Hauptmann - In CVPR , 2015
"... In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where on ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkit. This paper makes two contributions to the inference of CNN video representa-tion. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be sig-nificantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally afford-able. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8 % for the TRECVID MEDTest 14 dataset and from 34.0 % to 44.6 % for the TRECVID MEDTest 13 dataset. 1. Introduction and Related
(Show Context)

Citation Context

...ained by concatenating uk over all the K centers. Another variant of VLAD called VLAD-k, which extends the nearest centers with the k-nearest centers, has shown good performance in action recognition =-=[18, 33]-=-. Without specification, we utilize VLAD-k with k = 5 by default. Except for the power and `2 normalization, we apply intranormalization [4] to VLAD. 3.2.3 Quantitative Analysis Given the above three ...

Towards good practices for very deep two-stream ConvNets

by Limin Wang, Yuanjun Xiong, Zhe Wang, Yu Qiao - CoRR
"... Deep convolutional networks have achieved great suc-cess for object recognition in still images. However, for ac-tion recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the curre ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
Deep convolutional networks have achieved great suc-cess for object recognition in still images. However, for ac-tion recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets [12]) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet [13], GoogLeNet [15]), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely smal-l compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. Howev-er, this extension is not easy as the size of action recogni-tion is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) small-er learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high compu-tational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of 91:4%. 1.
(Show Context)

Citation Context

...2. Performance comparison of different architectures on the THUMOS15 [2] validation dataset. (from [20], without using our proposed good practices) Method Year Accuracy iDT+FV [16] 2013 85.9% iDT+HSV =-=[10]-=- 2014 87.9% MIFS+FV [8] 2015 89.1% TDD+FV [19] 2015 90.3% DeepNet [6] 2014 63.3% Two-stream [12] 2014 88.0% Two-stream+LSTM [9] 2015 88.6% Very deep two-stream 2015 91.4% Table 3. Performance comparis...

CUHK&SIAT submission for THUMOS15 action recognition challenge

by Limin Wang, Zhe Wang, Yuanjun Xiong, Yu Qiao - In THUMOS’15 Action Recognition Challenge
"... This paper presents the method of our submission for THUMOS15 action recognition challenge. We propose a new action recognition system by exploiting very deep two-stream ConvNets and Fisher vector representation of iDT features. Specifically, we utilize those successful very deep architectures in im ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
This paper presents the method of our submission for THUMOS15 action recognition challenge. We propose a new action recognition system by exploiting very deep two-stream ConvNets and Fisher vector representation of iDT features. Specifically, we utilize those successful very deep architectures in images such as GoogLeNet and VGGNet to design the two-stream ConvNets. From our experiments, we see that deeper architectures obtain higher performance for spatial nets. However, for temporal net, deeper archi-tectures could not yield better recognition accuracy. We an-alyze that the UCF101 dataset is relatively very small and it is very hard to train such deep networks on the current ac-tion datasets. Compared with traditional iDT features, our implemented two-stream ConvNets significantly outperfor-m them. We further combine the recognition scores of both two-stream ConvNets and iDT features, and achieve 68% mAP value on the validation dataset of THUMOS15. 1.
(Show Context)

Citation Context

...ction recognition challenge. In previous research works, there are mainly two styles of algorithms for action recognition. The first style is lowlevel features with Bag of Visual Words representation =-=[5]-=-, and the second one is applying deep neural networks to perform action recognition in an end-to-end manner [6]. The most successful low-level feature is the Improved Trajectories [11] and the most co...

Action Recognition and Detection by Combining Motion and Appearance Features

by Limin Wang, Yu Qiao, Xiaoou Tang
"... Abstract. We present an action recognition and detection system from temporally untrimmed videos by combining motion and appearance fea-tures. Motion and appearance provides two complementary cues for hu-man action understanding from videos. For motion features, we adopt the Fisher vector representa ..."
Abstract - Cited by 1 (1 self) - Add to MetaCart
Abstract. We present an action recognition and detection system from temporally untrimmed videos by combining motion and appearance fea-tures. Motion and appearance provides two complementary cues for hu-man action understanding from videos. For motion features, we adopt the Fisher vector representation with improved dense trajectories due to its rich descriptive capacity. For appearance feature, we choose the deep convolutional neural network activations due to its recent success in image based tasks. With this fused feature of iDT and CNN, we train a SVM classifier for each action class in one-vs-all scheme. We report both the recognition and detection results of our system on THUMOS 14 Challenge. 1
(Show Context)

Citation Context

... method for the tasks of both action recognition and detection from temporal untrimmed videos. Video representation plays an important role in the action recognition and detections. Recent study work =-=[7]-=- shows that the Fisher Vector representation [9] with improved Dense Trajectory (iDT) features [11] is very effective for capturing motion information, and it has obtained the state-of-the-art perform...

Encoding Feature Maps of CNNs for Action Recognition

by Xiaojiang Peng , Cordelia Schmid Inria
"... Abstract We describe our approach for action classification in the THUMOS Challenge 2015. Our approach is based on two types of features, improved dense trajectories and CNN features. For trajectory features, we extract HOG, HOF, MBHx, and MBHy descriptors and apply Fisher vector encoding. For CNN ..."
Abstract - Add to MetaCart
Abstract We describe our approach for action classification in the THUMOS Challenge 2015. Our approach is based on two types of features, improved dense trajectories and CNN features. For trajectory features, we extract HOG, HOF, MBHx, and MBHy descriptors and apply Fisher vector encoding. For CNN features, we utilize a recent deep CNN model, VGG19, to capture appearance features and use VLAD encoding to encode/pool convolutional feature maps which shows better performance than average pooling of feature maps and full-connected activation features. After concatenating them, we train a linear SVM classifier for each class in a one-vs-all scheme.
(Show Context)

Citation Context

...inear SVMs in a one-vs-all scheme. ∗LEAR team, Inria Grenoble Rhone-Alpes, Laboratoire Jean Kuntzmann, CNRS, Univ. Grenoble Alpes, France. 2.1. IDT based representation Improved dense trajectories (IDT) based video representations have shown excellent performance on many action datasets [7]. IDT includes local appearance (HOG) and motion (HOF/MBH) descriptors. We rescale the videos to be at most 320 pixels wide, and skip every second frame to extract IDT features.We use a vocabulary of size 256 for GMM, and apply Fisher vector encoding separately for HOG, HOF, MBHx, and MBHy descriptors as [3][4]. We, then, normalize the resulting supervectors by power and intra normalization as suggested in [4], i.e., performing `2 normalization for each FV block independently after power normalization. 2.2. CNN feature maps based representation CNN features have become increasingly popular in action recognition [2] [9]. In [2], a video representation is obtained by average pooling of fc6 activation extracted for static frames every 10 frames. In [9], VLAD and Fisher vector are applied to fc6 activations and pool5 feature maps for event detection on TRECVID MED dataset. Following [9], we leverage VLA...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University