Results 1 - 10
of
19
Two-stream convolutional networks for action recognition in videos
- CoRR
"... We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design as ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
(Show Context)
We investigate architectures of discriminatively trained deep Convolutional Net-works (ConvNets) for action recognition in video. The challenge is to capture the complementary information on appearance from still frames and motion between frames. We also aim to incorporate into the network design aspects of the best performing hand-crafted features. Our contribution is three-fold. First, we propose a two-stream ConvNet architec-ture which incorporates spatial and temporal networks. Second, we demonstrate that a ConvNet trained on multi-frame dense optical flow is able to achieve very good performance in spite of limited training data. Finally, we show that multi-task learning, applied to two different action classification datasets, can be used to increase the amount of training data and improve the performance on both. Our architecture is trained and evaluated on the standard video actions benchmarks of UCF-101 and HMDB-51, where it matches the state of the art. It also exceeds by a large margin previous attempts to use deep nets for video classification. 1
Video action detection with relational dynamic-poselets
- In ECCV
, 2014
"... •Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights: ..."
Abstract
-
Cited by 12 (3 self)
- Add to MetaCart
(Show Context)
•Problem: We aim to not only recognize on-going action class (action recognition), but also localize its spatiotemporal extent (action detection), and even estimate the pose of the actor (pose estimation). •Key insights:
Action recognition with trajectory-pooled deep-convolutional descriptors
- In CVPR
, 2015
"... Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we ..."
Abstract
-
Cited by 8 (5 self)
- Add to MetaCart
(Show Context)
Visual features are of vital importance for human action understanding in videos. This paper presents a new video representation, called trajectory-pooled deep-convolutional descriptor (TDD), which shares the merits of both hand-crafted features [31] and deep-learned features [24]. Specifically, we utilize deep architectures to learn discrimi-native convolutional feature maps, and conduct trajectory-constrained pooling to aggregate these convolutional fea-tures into effective descriptors. To enhance the robustness of TDDs, we design two normalization methods to trans-form convolutional feature maps, namely spatiotemporal normalization and channel normalization. The advantages of our features come from (i) TDDs are automatically learned and contain high discriminative capacity compared with those hand-crafted features; (ii) TDDs take account of the intrinsic characteristics of temporal dimension and introduce the strategies of trajectory-constrained sampling and pooling for aggregating deep-learned features. We conduct experiments on two challenging datasets: HMD-B51 and UCF101. Experimental results show that TDDs outperform previous hand-crafted features [31] and deep-learned features [24]. Our method also achieves superior performance to the state of the art on these datasets 1.
Action and gesture temporal spotting with super vector representation
- In ECCV, ChaLearn Looking at People Workshop
, 2014
"... Abstract. This paper focuses on describing our method designed for both track 2 and track 3 at Looking at People (LAP) challenging [1]. We propose an action and gesture spotting system, which is mainly com-posed of three steps: (i) temporal segmentation, (ii) clip classification, and (iii) post proc ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract. This paper focuses on describing our method designed for both track 2 and track 3 at Looking at People (LAP) challenging [1]. We propose an action and gesture spotting system, which is mainly com-posed of three steps: (i) temporal segmentation, (ii) clip classification, and (iii) post processing. For track 2, we resort to a simple sliding win-dow method to divide each video sequence into clips, while for track 3, we design a segmentation method based on the motion analysis of human hands. Then, for each clip, we choose a kind of super vector representa-tion with dense features. Based on this representation, we train a linear SVM to conduct action and gesture recognition. Finally, we use some post processing techniques to void the detection of false positives. We demonstrate the effectiveness of our proposed method by participating the contests of both track 2 and track 3. We obtain the best performance on track 2 and rank 4th on track 3, which indicates that the designed system is effective for action and gesture recognition.
Beyond gaussian pyramid: Multi-skip feature stacking for action recognition
- In CVPR
, 2015
"... Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This atten-uation introduces bias to the resulting features and gener-ates ill-conditioned feature matrices. The Gaussian Pyra-mid has ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Most state-of-the-art action feature extractors involve differential operators, which act as highpass filters and tend to attenuate low frequency action information. This atten-uation introduces bias to the resulting features and gener-ates ill-conditioned feature matrices. The Gaussian Pyra-mid has been used as a feature enhancing technique that en-codes scale-invariant characteristics into the feature space in an attempt to deal with this attenuation. However, at the core of the Gaussian Pyramid is a convolutional smooth-ing operation, which makes it incapable of generating new features at coarse scales. In order to address this prob-lem, we propose a novel feature enhancing technique called Multi-skIp Feature Stacking (MIFS), which stacks features extracted using a family of differential filters parameterized with multiple time skips and encodes shift-invariance into the frequency space. MIFS compensates for information lost from using differential operators by recapturing infor-mation at coarse scales. This recaptured information al-lows us to match actions at different speeds and ranges of motion. We prove that MIFS enhances the learnability of differential-based features exponentially. The resulting fea-ture matrices from MIFS have a much smaller conditional numbers and variances than those from conventional meth-ods. Experimental results show significantly improved per-formance on challenging action recognition and event de-tection tasks. Specifically, our method exceeds the state-of-
A discriminative cnn video representation for event detection
- In CVPR
, 2015
"... In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where on ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkit. This paper makes two contributions to the inference of CNN video representa-tion. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be sig-nificantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally afford-able. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8 % for the TRECVID MEDTest 14 dataset and from 34.0 % to 44.6 % for the TRECVID MEDTest 13 dataset. 1. Introduction and Related
Towards good practices for very deep two-stream ConvNets
- CoRR
"... Deep convolutional networks have achieved great suc-cess for object recognition in still images. However, for ac-tion recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the curre ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Deep convolutional networks have achieved great suc-cess for object recognition in still images. However, for ac-tion recognition in videos, the improvement of deep convo-lutional networks is not so evident. We argue that there are two reasons that could probably explain this result. First the current network architectures (e.g. Two-stream ConvNets [12]) are relatively shallow compared with those very deep models in image domain (e.g. VGGNet [13], GoogLeNet [15]), and therefore their modeling capacity is constrained by their depth. Second, probably more importantly, the training dataset of action recognition is extremely smal-l compared with the ImageNet dataset, and thus it will be easy to over-fit on the training dataset. To address these issues, this report presents very deep two-stream ConvNets for action recognition, by adapting recent very deep architectures into video domain. Howev-er, this extension is not easy as the size of action recogni-tion is quite small. We design several good practices for the training of very deep two-stream ConvNets, namely (i) pre-training for both spatial and temporal nets, (ii) small-er learning rates, (iii) more data augmentation techniques, (iv) high drop out ratio. Meanwhile, we extend the Caffe toolbox into Multi-GPU implementation with high compu-tational efficiency and low memory consumption. We verify the performance of very deep two-stream ConvNets on the dataset of UCF101 and it achieves the recognition accuracy of 91:4%. 1.
CUHK&SIAT submission for THUMOS15 action recognition challenge
- In THUMOS’15 Action Recognition Challenge
"... This paper presents the method of our submission for THUMOS15 action recognition challenge. We propose a new action recognition system by exploiting very deep two-stream ConvNets and Fisher vector representation of iDT features. Specifically, we utilize those successful very deep architectures in im ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
This paper presents the method of our submission for THUMOS15 action recognition challenge. We propose a new action recognition system by exploiting very deep two-stream ConvNets and Fisher vector representation of iDT features. Specifically, we utilize those successful very deep architectures in images such as GoogLeNet and VGGNet to design the two-stream ConvNets. From our experiments, we see that deeper architectures obtain higher performance for spatial nets. However, for temporal net, deeper archi-tectures could not yield better recognition accuracy. We an-alyze that the UCF101 dataset is relatively very small and it is very hard to train such deep networks on the current ac-tion datasets. Compared with traditional iDT features, our implemented two-stream ConvNets significantly outperfor-m them. We further combine the recognition scores of both two-stream ConvNets and iDT features, and achieve 68% mAP value on the validation dataset of THUMOS15. 1.
Action Recognition and Detection by Combining Motion and Appearance Features
"... Abstract. We present an action recognition and detection system from temporally untrimmed videos by combining motion and appearance fea-tures. Motion and appearance provides two complementary cues for hu-man action understanding from videos. For motion features, we adopt the Fisher vector representa ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
Abstract. We present an action recognition and detection system from temporally untrimmed videos by combining motion and appearance fea-tures. Motion and appearance provides two complementary cues for hu-man action understanding from videos. For motion features, we adopt the Fisher vector representation with improved dense trajectories due to its rich descriptive capacity. For appearance feature, we choose the deep convolutional neural network activations due to its recent success in image based tasks. With this fused feature of iDT and CNN, we train a SVM classifier for each action class in one-vs-all scheme. We report both the recognition and detection results of our system on THUMOS 14 Challenge. 1
Encoding Feature Maps of CNNs for Action Recognition
"... Abstract We describe our approach for action classification in the THUMOS Challenge 2015. Our approach is based on two types of features, improved dense trajectories and CNN features. For trajectory features, we extract HOG, HOF, MBHx, and MBHy descriptors and apply Fisher vector encoding. For CNN ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract We describe our approach for action classification in the THUMOS Challenge 2015. Our approach is based on two types of features, improved dense trajectories and CNN features. For trajectory features, we extract HOG, HOF, MBHx, and MBHy descriptors and apply Fisher vector encoding. For CNN features, we utilize a recent deep CNN model, VGG19, to capture appearance features and use VLAD encoding to encode/pool convolutional feature maps which shows better performance than average pooling of feature maps and full-connected activation features. After concatenating them, we train a linear SVM classifier for each class in a one-vs-all scheme.