DMCA
Latent Hierarchical Model of Temporal Structure for Complex Activity Classification
Citations: | 8 - 7 self |
Citations
6582 |
Neural Networks for Pattern Recognition
- Bishop
(Show Context)
Citation Context ...sian Networks (DBNs) [4], [34]. The learning and inference of graphical model were usually conducted by some approximate methods such as Expectation Maximum, Variational Methods, and Sampling Methods =-=[35]-=-. The learning process is complex and usually needs a large mount of data to avoid overfitting. In addition to graphical models, some research works resorted to Max-Margin Methods [12], [14]. They for... |
6467 | LIBSVM: a library for support vector machines
- Chang, Lin
- 2011
(Show Context)
Citation Context ...ee public action datasets: the KTH [48], the Hollywood2 [49], and the Olympic Sports Dataset [12]. Then we further explore some important aspects of LHM. For the three datasets, we use LIBSVM package =-=[50]-=- to solve the standard SVM problem in the learning framework of Section IV-A. For multi-class classification, we apply the one-vs-all training scheme. A. KTH Dataset The KTH is a relatively simple dat... |
3382 | A tutorial on support vector machines for pattern recognition
- Burges
- 1998
(Show Context)
Citation Context ...ωi, j , . . .) ϒ(Vm , h) = (φ1(Vm, z1), . . . , φN (Vm, zN ), . . . , ψi, j (zi , z j ), . . .). (8) During training process, each training sample Vm just have class label ym . Unlike traditional SVM =-=[46]-=-, the problem (Equation (6)) is not convex since f ∗(Vm) contains an maximum operation over h, which is called Latent SVM in [16]. It can be shown that the problem will become convex for the model par... |
1418 | Object detection with discriminatively trained part based models
- Felzenszwalb, Girshick, et al.
(Show Context)
Citation Context ... our model flexible and effective to deal with the duration variations and temporal displacements of sub-activities. We formulate the learning and inference problem of LHM in the latent SVM framework =-=[16]-=-. Since LHM has a deeper structure with more latent variables, it is infeasible to traverse all possible configurations of sub-activities during classification process. We develop a cascade inference ... |
1001 | Visual categorization with bags of keypoints
- Csurka, Dance, et al.
- 2004
(Show Context)
Citation Context ...ly is composed of two steps: (i) encoding of the local features, (ii) feature pooling and normalization. There were a large body of researches on the encoding methods such as Vector Quantization (VQ) =-=[27]-=-, Soft-assignment Encoding (SA) [28], Fisher Vector (FV) [29], Sparse Coding (SPC) [30], Locality-constrained Linear Encoding (LLC) [31], and so on. These methods focus on minimizing information loss ... |
757 | Recognizing human actions: A local SVM approach
- Schuldt, Laptev, et al.
- 2004
(Show Context)
Citation Context ...β2 (β2 = 1.3 in experiments) over positive training samples. Note that the deformation term is usually negative. V. EXPERIMENTS We firstly conduct experiments on three public action datasets: the KTH =-=[48]-=-, the Hollywood2 [49], and the Olympic Sports Dataset [12]. Then we further explore some important aspects of LHM. For the three datasets, we use LIBSVM package [50] to solve the standard SVM problem ... |
735 | Learning realistic human actions from movies - Laptev, Marszalek, et al. - 2008 |
716 | Behavior recognition via sparse spatio-temporal features
- Dollár, Rabaud, et al.
- 2005
(Show Context)
Citation Context ...es turn out to be effective in action recognition [18]. In recent years, researchers have developed many different spatiotemporal detectors for video, such as 3D-Harris [19], 3D-Hessian [20], Cuboids =-=[21]-=-, and Dense [22]. Then, a local 3D-region is extracted around the interested points and a histogram descriptor is computed to capture the appearance and motion information. There were some typical des... |
537 | A Bayesian computer vision system for modeling human interactions
- Oliver, Rosario, et al.
(Show Context)
Citation Context ...onds to a relatively simple sub-activity, and there exists a temporal order among these phases. The importance of temporal structure in activity classification has been demonstrated in previous works =-=[9]-=-–[15]. However, the effective modeling of temporal structure is still challenging due to the following two problems. The first problem is that “sub-activity” usually has no precise definition given a ... |
491 | Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification
- Yang, Yu, et al.
- 2009
(Show Context)
Citation Context ...nd normalization. There were a large body of researches on the encoding methods such as Vector Quantization (VQ) [27], Soft-assignment Encoding (SA) [28], Fisher Vector (FV) [29], Sparse Coding (SPC) =-=[30]-=-, Locality-constrained Linear Encoding (LLC) [31], and so on. These methods focus on minimizing information loss and improving encoding efficiency. For pooling method, there were usually two typical m... |
436 | Localityconstrained linear coding for image classification
- Wang, Yang, et al.
- 2010
(Show Context)
Citation Context ...arches on the encoding methods such as Vector Quantization (VQ) [27], Soft-assignment Encoding (SA) [28], Fisher Vector (FV) [29], Sparse Coding (SPC) [30], Locality-constrained Linear Encoding (LLC) =-=[31]-=-, and so on. These methods focus on minimizing information loss and improving encoding efficiency. For pooling method, there were usually two typical methods, sum pooling [27] and max pooling [30], an... |
292 | Action recognition by dense trajectories
- Wang, Kläser, et al.
- 2011
(Show Context)
Citation Context ...e effective in action recognition [18]. In recent years, researchers have developed many different spatiotemporal detectors for video, such as 3D-Harris [19], 3D-Hessian [20], Cuboids [21], and Dense =-=[22]-=-. Then, a local 3D-region is extracted around the interested points and a histogram descriptor is computed to capture the appearance and motion information. There were some typical descriptors such as... |
274 | Evaluation of Local Spatio-Temporal Features for Action Recognition,
- Wang, Ullah, et al.
- 2009
(Show Context)
Citation Context ...refer to [1]–[3] for good surveys. Video Representation. Video representation has been a central issue of activity recognition. Low-level local features turn out to be effective in action recognition =-=[18]-=-. In recent years, researchers have developed many different spatiotemporal detectors for video, such as 3D-Harris [19], 3D-Hessian [20], Cuboids [21], and Dense [22]. Then, a local 3D-region is extra... |
233 | A spatio-temporal descriptor based on 3d-gradients.
- Klaser, Marszalek, et al.
- 2008
(Show Context)
Citation Context ...d motion information. There were some typical descriptors such as Histogram of Gradient and Histogram of Flow (HOG/HOF) [23], Histogram of Motion Boundary (MBH) [22], 3D Histogram of Gradient (HOG3D) =-=[24]-=-, Extended SURF (ESURF) [20], Co-occurrence descriptor [25], and so on. Finally, a global representation is obtained for each video clip via a statistical model. Among these statistical models, Bag of... |
222 | Automatic soccer video analysis and summarization
- Ekin, Tekalp, et al.
(Show Context)
Citation Context ... vision [1]– [3], whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance [4], [5], human-computer interface [6], sports video analysis =-=[7]-=-, and content based video retrieval [8]. The challenges of activity classification come from many aspects. Firstly, there always exist large intra-class appearance and motion variations within Manuscr... |
218 | Machine Recognition of Human Activities: A survey - Turaga, Chellappa - 2008 |
214 | Human activity analysis: A review
- Aggarwal, Ryoo
(Show Context)
Citation Context ...ctivity classification, hierarchical model, deep structure, latent learning, cascade inference. I. INTRODUCTION HUMAN activity classification is an important yet dif-ficult problem in computer vision =-=[1]-=-– [3], whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance [4], [5], human-computer interface [6], sports video analysis [7], and co... |
170 | Action bank: A high-level representation of activity in video
- Sadanand, Corso
- 2012
(Show Context)
Citation Context ... to these low-level local features and BoVW representation, there were some research works on mid-level and high-level representations such as motionlet [32], motion atom and phrase [15], action bank =-=[33]-=-, and so on. Temporal Structure. The importance of temporal structures in recognizing human activity has been studied in previous researches [9]– [15] and [34]. Probabilistic graphical models were usu... |
168 | Cascade object detection with deformable part models
- Felzenszwalb, Girshick, et al.
- 2010
(Show Context)
Citation Context ...nt variables for each segment, thus it is very time-consuming to calculate the appearance similarities of all possible configurations in advance. Inspired by the method of cascade object detection in =-=[47]-=-, we design a cascade inference algorithm for LHM. The core idea of our algorithm is to make use of dynamic programming and prune techniques to constrain the search space of h and accelerate the proce... |
167 | An efficient dense and scale-invariant spatio-temporal interest point detector.
- Willems, Tuytelaars, et al.
- 2008
(Show Context)
Citation Context ...l local features turn out to be effective in action recognition [18]. In recent years, researchers have developed many different spatiotemporal detectors for video, such as 3D-Harris [19], 3D-Hessian =-=[20]-=-, Cuboids [21], and Dense [22]. Then, a local 3D-region is extracted around the interested points and a histogram descriptor is computed to capture the appearance and motion information. There were so... |
157 | Modeling Temporal Structure of Decomposable Motion Segments for Activity Classification. ECCV
- Niebles, Chen, et al.
- 2010
(Show Context)
Citation Context ... V). II. RELATED WORK Human activity classification has been studied extensively in recent years. In this paper, complex activities refer to those with long temporal structures such as Sports actions =-=[12]-=-, Cooking Composite actions [17], and so on. Here we only overview a few related works and readers can refer to [1]–[3] for good surveys. Video Representation. Video representation has been a central ... |
144 | Learning a hierarchy of discriminative spacetime neighborhood features
- Kovashka, Grauman
- 2010
(Show Context)
Citation Context ...gically inspired by the brain architecture and vision system [36], [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], =-=[39]-=-, object detection [16], [40], [41], human body parsing [42], image parsing [43], and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hier... |
125 | Kernel codebooks for scene categorization
- Gemert, Geusebroek, et al.
(Show Context)
Citation Context ...oding of the local features, (ii) feature pooling and normalization. There were a large body of researches on the encoding methods such as Vector Quantization (VQ) [27], Soft-assignment Encoding (SA) =-=[28]-=-, Fisher Vector (FV) [29], Sparse Coding (SPC) [30], Locality-constrained Linear Encoding (LLC) [31], and so on. These methods focus on minimizing information loss and improving encoding efficiency. F... |
122 | Dense trajectories and motion boundary descriptors for action recognition,”
- Wang, Klaser, et al.
- 2013
(Show Context)
Citation Context ...− 3 − 9 structure is shown in TABLE IV RESULTS OF LHM WITH DENSE TRAJECTORY ON THE OLYMPIC SPORTS DATASET AND THE HOLLYWOOD2 DATASET. WE COMPARE OUR RESULTS WITH THAT OF THE STATE-OF-THE-ART APPROACH =-=[53]-=-. THE BOLD FONTS INDICATES THE BEST PERFORMANCE Fig. 7. Comparison of efficiency between inference with cascade and without cascade. Fig. 7. For top layer, we need to calculate the same number of segm... |
119 | Multimodal Human Computer Interaction: A Survey, In:
- JAIMES, SEBE
- 2007
(Show Context)
Citation Context ...-ficult problem in computer vision [1]– [3], whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance [4], [5], human-computer interface =-=[6]-=-, sports video analysis [7], and content based video retrieval [8]. The challenges of activity classification come from many aspects. Firstly, there always exist large intra-class appearance and motio... |
105 |
Hidden conditional random fields for gesture recognition
- Wang, Quattoni, et al.
- 2006
(Show Context)
Citation Context ... human activity or motion trajectories, such as Hidden 812 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 Markov Models (HMMs) [5], [9], Hidden Conditional Random Fields (HCRFs) =-=[10]-=-, [11], and Dynamic Bayesian Networks (DBNs) [4], [34]. The learning and inference of graphical model were usually conducted by some approximate methods such as Expectation Maximum, Variational Method... |
98 | Understanding Videos, Constructing Plots: Learning a Visually Grounded Storyline Model from Annotated Videos
- Gupta, Srinivasan, et al.
- 2009
(Show Context)
Citation Context ...ision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], [41], human body parsing [42], image parsing [43], and video understanding =-=[44]-=-. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hierarchical model with deep structure for object detection. In their method, an object was represented by a mixtu... |
91 |
Image classification with the fisher vector: theory and practice
- Sánchez, Perronnin, et al.
- 2013
(Show Context)
Citation Context ...es, (ii) feature pooling and normalization. There were a large body of researches on the encoding methods such as Vector Quantization (VQ) [27], Soft-assignment Encoding (SA) [28], Fisher Vector (FV) =-=[29]-=-, Sparse Coding (SPC) [30], Locality-constrained Linear Encoding (LLC) [31], and so on. These methods focus on minimizing information loss and improving encoding efficiency. For pooling method, there ... |
86 | W.T.: Latent Hierarchical Structural Learning for Object Detection. In: CVPR
- Zhu, Chen, et al.
- 2010
(Show Context)
Citation Context ... architecture and vision system [36], [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], =-=[40]-=-, [41], human body parsing [42], image parsing [43], and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hierarchical model with deep stru... |
75 | Learning latent temporal structure for complex event detection.
- Tang, Li, et al.
- 2012
(Show Context)
Citation Context ...g Methods [35]. The learning process is complex and usually needs a large mount of data to avoid overfitting. In addition to graphical models, some research works resorted to Max-Margin Methods [12], =-=[14]-=-. They formulated the learning problem using Latent SVM [16], which has been shown to be effective in object detection. These methods maked use of Latent SVM to estimate the model parameters and condu... |
74 | C.: Convolutional learning of spatiotemporal features.
- Taylor, Fergus, et al.
- 2010
(Show Context)
Citation Context ...ompared with the KTH dataset. Comparison with Other Methods. We compare our method with three other methods: the BoVW model [19] (baseline), the context model [49], and Convolutional Gated RBM (GRBM) =-=[51]-=-. The BoVW model uses the same features and codebook size and the context model exploits the static scenes as a cue for action recognition. The Convolutional Gated RBM aims to learn the features direc... |
68 | Computational studies of human motion: part 1, tracking and motion
- Forsyth, Arikan, et al.
- 2005
(Show Context)
Citation Context ...ty classification, hierarchical model, deep structure, latent learning, cascade inference. I. INTRODUCTION HUMAN activity classification is an important yet dif-ficult problem in computer vision [1]– =-=[3]-=-, whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance [4], [5], human-computer interface [6], sports video analysis [7], and content... |
65 | Detecting activities of daily living in first-person camera views.
- Pirsiavash, Ramanan
- 2012
(Show Context)
Citation Context ...e settings and their influences on final recognition performance. Secondly, we investigate the effectiveness of latent variables by comparing the recognition performance of LHM with temporal pyramids =-=[52]-=-. Temporal pyramids decompose each video into segments of equal duration, while LHM automatically aligns video by efficient search in latent variable space. Then, we investigate the inference efficien... |
59 | Object trajectory-based activity classification and recognition using hidden markov models,”
- Bashir, Khokhar, et al.
- 2007
(Show Context)
Citation Context ...cation is an important yet dif-ficult problem in computer vision [1]– [3], whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance [4], =-=[5]-=-, human-computer interface [6], sports video analysis [7], and content based video retrieval [8]. The challenges of activity classification come from many aspects. Firstly, there always exist large in... |
45 | A quantitative theory of immediate visual recognition
- Serre, Kreiman, et al.
(Show Context)
Citation Context ...l structure plays an important role to improve the recognition performance. Hierarchical Model. Hierarchical tree-structured model is biologically inspired by the brain architecture and vision system =-=[36]-=-, [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], [41], human body parsing [42],... |
41 | Leveraging temporal, contextual and ordering constraints for recognizing complex activities
- Laxton, Lim, et al.
- 2007
(Show Context)
Citation Context ...motion atom and phrase [15], action bank [33], and so on. Temporal Structure. The importance of temporal structures in recognizing human activity has been studied in previous researches [9]– [15] and =-=[34]-=-. Probabilistic graphical models were usually adopted to model the temporal structure of human activity or motion trajectories, such as Hidden 812 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2... |
39 |
Hidden part models for human action recognition: Probabilistic versus max margin.
- Wang, Mori
- 2011
(Show Context)
Citation Context ... activity or motion trajectories, such as Hidden 812 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 2, FEBRUARY 2014 Markov Models (HMMs) [5], [9], Hidden Conditional Random Fields (HCRFs) [10], =-=[11]-=-, and Dynamic Bayesian Networks (DBNs) [4], [34]. The learning and inference of graphical model were usually conducted by some approximate methods such as Expectation Maximum, Variational Methods, and... |
36 |
On space-time interest points,” Int
- Laptev
- 2005
(Show Context)
Citation Context ...gnition. Low-level local features turn out to be effective in action recognition [18]. In recent years, researchers have developed many different spatiotemporal detectors for video, such as 3D-Harris =-=[19]-=-, 3D-Hessian [20], Cuboids [21], and Dense [22]. Then, a local 3D-region is extracted around the interested points and a histogram descriptor is computed to capture the appearance and motion informati... |
27 | A comparative study of encoding, pooling and normalization methods for action recognition
- Wang, Wang, et al.
- 2012
(Show Context)
Citation Context ...nd so on. Finally, a global representation is obtained for each video clip via a statistical model. Among these statistical models, Bag of Visual Words (BoVW) is a common choice in action recognition =-=[26]-=-. Based on local features, BoVW construction usually is composed of two steps: (i) encoding of the local features, (ii) feature pooling and normalization. There were a large body of researches on the ... |
21 | Motionlets: mid-level 3D parts for human motion recognition
- Wang, Qiao, et al.
(Show Context)
Citation Context ...tion, and power-normalization [29]. In addition to these low-level local features and BoVW representation, there were some research works on mid-level and high-level representations such as motionlet =-=[32]-=-, motion atom and phrase [15], action bank [33], and so on. Temporal Structure. The importance of temporal structures in recognizing human activity has been studied in previous researches [9]– [15] an... |
21 |
An implicit shape model for combined object categorization and segmentation
- Leibe, Leonardis, et al.
- 2006
(Show Context)
Citation Context ...ub-activity can be further decomposed recursively until leaf node, which represents the atomic activity (e.g. start run, speed up, jump up, rolling). In essence, LHM is a generalization of STAR model =-=[45]-=- with the independence assumption that child nodes are independently placed in a coordinate system determined by their parent node. This generalization provides more descriptive capacity to LHM and ye... |
20 | Action Recognition with Multiscale Spatio-Temporal Contexts. - Wang, Chen, et al. - 2011 |
19 |
Learning deep architectures for ai,” Found
- Bengio
- 2009
(Show Context)
Citation Context ... biologically inspired by the brain architecture and vision system [36], [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies =-=[38]-=-, [39], object detection [16], [40], [41], human body parsing [42], image parsing [43], and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed ... |
17 | Active mask hierarchies for object detection
- Chen, Zhu, et al.
- 2010
(Show Context)
Citation Context ...tecture and vision system [36], [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], =-=[41]-=-, human body parsing [42], image parsing [43], and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hierarchical model with deep structure ... |
16 | Mining motion atoms and phrases for complex action recognition
- Wang, Qiao, et al.
- 2013
(Show Context)
Citation Context ... to a relatively simple sub-activity, and there exists a temporal order among these phases. The importance of temporal structure in activity classification has been demonstrated in previous works [9]–=-=[15]-=-. However, the effective modeling of temporal structure is still challenging due to the following two problems. The first problem is that “sub-activity” usually has no precise definition given a compl... |
13 | Max Margin AND/OR Graph learning for parsing the human body.
- Zhu, Chen, et al.
- 2008
(Show Context)
Citation Context ... [36], [37]. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], [41], human body parsing =-=[42]-=-, image parsing [43], and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hierarchical model with deep structure for object detection. In ... |
12 | Script data for attribute-based recognition of composite activities. - Rohrbach, Regneri, et al. - 2012 |
11 |
L.: Deep hierarchies in the primate visual cortex: What can we learn for computer vision
- Krüger, Janssen, et al.
- 2013
(Show Context)
Citation Context ...cture plays an important role to improve the recognition performance. Hierarchical Model. Hierarchical tree-structured model is biologically inspired by the brain architecture and vision system [36], =-=[37]-=-. It has been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], [41], human body parsing [42], image... |
10 |
Concept-based video retrieval,” Found
- Snoek, Worring
- 2009
(Show Context)
Citation Context ...mine what people are doing given an observed video. It has wide applications in video surveillance [4], [5], human-computer interface [6], sports video analysis [7], and content based video retrieval =-=[8]-=-. The challenges of activity classification come from many aspects. Firstly, there always exist large intra-class appearance and motion variations within Manuscript received February 22, 2013; revised... |
9 |
A stochastic grammar of images,” Found
- Zhu, Mumford
- 2006
(Show Context)
Citation Context ...been widely used in computer vision and achieved successes on various tasks, such as learning feature hierarchies [38], [39], object detection [16], [40], [41], human body parsing [42], image parsing =-=[43]-=-, and video understanding [44]. Our model is partially inspired by the work of [40] in which Zhu et al. developed a hierarchical model with deep structure for object detection. In their method, an obj... |
8 |
Exploring motion boundary based sampling and spatial-temporal context descriptors for action recognition
- Peng, Qiao, et al.
- 2013
(Show Context)
Citation Context ...uch as Histogram of Gradient and Histogram of Flow (HOG/HOF) [23], Histogram of Motion Boundary (MBH) [22], 3D Histogram of Gradient (HOG3D) [24], Extended SURF (ESURF) [20], Co-occurrence descriptor =-=[25]-=-, and so on. Finally, a global representation is obtained for each video clip via a statistical model. Among these statistical models, Bag of Visual Words (BoVW) is a common choice in action recogniti... |
5 | Trajectory classification using switched dynamical hidden Markov models”,
- Nascimento, Figueiredo, et al.
- 2010
(Show Context)
Citation Context ...ssification is an important yet dif-ficult problem in computer vision [1]– [3], whose aim is to determine what people are doing given an observed video. It has wide applications in video surveillance =-=[4]-=-, [5], human-computer interface [6], sports video analysis [7], and content based video retrieval [8]. The challenges of activity classification come from many aspects. Firstly, there always exist lar... |