Results 1 - 10
of
214
Action Recognition with Improved Trajectories
- in IEEE International Conference on Computer Vision (ICCV
, 2013
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 109 (11 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Learning spatiotemporal graphs of human activities
- In ICCV
, 2011
"... Complex human activities occurring in videos can be defined in terms of temporal configurations of primitive actions. Prior work typically hand-picks the primitives, their total number, and temporal relations (e.g., allow only followed-by), and then only estimates their relative significance for act ..."
Abstract
-
Cited by 64 (0 self)
- Add to MetaCart
(Show Context)
Complex human activities occurring in videos can be defined in terms of temporal configurations of primitive actions. Prior work typically hand-picks the primitives, their total number, and temporal relations (e.g., allow only followed-by), and then only estimates their relative significance for activity recognition. We advance prior work by learning what activity parts and their spatiotemporal relations should be captured to represent the activity, and how relevant they are for enabling efficient inference in realistic videos. We represent videos by spatiotemporal graphs, where nodes correspond to multiscale video segments, and edges capture their hierarchical, temporal, and spatial relationships. Access to video segments is provided by our new, multiscale segmenter. Given a set of training spatiotemporal graphs, we learn their archetype graph, and pdf’s associated with model nodes and edges. The model adaptively learns from data relevant video segments and their relations, addressing the “what ” and “how. ” Inference and learning are formulated within the same framework – that of a robust, least-squares optimization – which is invariant to arbitrary permutations of nodes in spatiotemporal graphs. The model is used for parsing new videos in terms of detecting and localizing relevant activity parts. We outperform the state of the art on benchmark Olympic and UT human-interaction datasets, under a favorable complexityvs.-accuracy trade-off. 1.
Learning human activities and object affordances from rgb-d videos. IJRR
, 2013
"... such as making cereal and arranging objects in a room (see Fig. 9). For example, the making cereal activity consists of around 12 sub-activities on average, which includes reaching the pitcher, moving the pitcher to the bowl, and then pouring the milk into the bowl. This proves to be a very challeng ..."
Abstract
-
Cited by 59 (16 self)
- Add to MetaCart
such as making cereal and arranging objects in a room (see Fig. 9). For example, the making cereal activity consists of around 12 sub-activities on average, which includes reaching the pitcher, moving the pitcher to the bowl, and then pouring the milk into the bowl. This proves to be a very challenging task given the variability across individuals in performing each sub-activity, and other environment induced conditions such as cluttered background and viewpoint changes. (See Fig. 2 for some examples.) In most previous works, object detection and activity recognition have been addressed as separate tasks. Only recently, some works have shown that modeling mutual context is beneficial (Gupta et al., 2009; Yao and Fei-Fei, 2010). The key idea in our work is to note that, in activity detection, it is sometimes more informative to know how an object is being used (associated affordances, Gibson, 1979) rather than knowing what the object is (i.e. the object category). For example, both chair and sofa might be categorized as ‘sittable, ’ and a cup might be categorized as both ‘drinkable ’ and ‘pourable. ’ Note that the affordances of an object change over time depending on its use, e.g., a pitcher may first be reachable, then movable and finally pourable. In addition to helping activity recognition, recognizing object affordances is important by itself because of their use in robotic applications (e.g., Kormushev et al., 2010; Jiang et al., 2012a; Jiang and Saxena, 2012). We propose a method to learn human activities by modarXiv:1210.1207v2
View invariant human action recognition using histograms of 3D joints
- IN: PROC. OF WORK. ON HUMAN ACTIVITY UNDERSTANDING FROM 3D DATA
, 2012
"... In this paper, we present a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures. We extract the 3D skeletal joint locations from Kinect depth maps using Shotton et al.’s method [6]. The HOJ3D computed from the action depth ..."
Abstract
-
Cited by 58 (3 self)
- Add to MetaCart
In this paper, we present a novel approach for human action recognition with histograms of 3D joint locations (HOJ3D) as a compact representation of postures. We extract the 3D skeletal joint locations from Kinect depth maps using Shotton et al.’s method [6]. The HOJ3D computed from the action depth sequences are reprojected using LDA and then clustered into k posture visual words, which represent the prototypical poses of actions. The temporal evolutions of those visual words are modeled by discrete hidden Markov models (HMMs). In addition, due to the design of our spherical coordinate system and the robust 3D skeleton estimation from Kinect, our method demonstrates significant view invariance on our 3D action dataset. Our dataset is composed of 200 3D sequences of 10 indoor activities performed by 10 individuals in varied views. Our method is real-time and achieves superior results on the challenging 3D action dataset. We also tested our algorithm on the MSR Action3D dataset and our algorithm outperforms Li et al. [25] on most of the cases.
A database for fine grained activity detection of cooking activities
- In CVPR
, 2012
"... While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body mot ..."
Abstract
-
Cited by 40 (5 self)
- Add to MetaCart
(Show Context)
While activity recognition is a current focus of research the challenging problem of fine-grained activity recognition is largely overlooked. We thus propose a novel database of 65 cooking activities, continuously recorded in a realistic setting. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intra-class variability due to diverse subjects and ingredients. We benchmark two approaches on our dataset, one based on articulated pose tracks and the second using holistic video features. While the holistic approach outperforms the pose-based approach, our evaluation suggests that fine-grained activities are more difficult to detect and the body model can help in those cases. Providing high-resolution videos as well as an intermediate pose representation we hope to foster research in fine-grained activity recognition. 1.
Human activity prediction: Early recognition of ongoing activities from streaming videos
- In IEEE International Conference on Computer Vision
, 2011
"... In this paper, we present a novel approach of human activity prediction. Human activity prediction is a proba-bilistic process of inferring ongoing activities from videos only containing onsets (i.e. the beginning part) of the activ-ities. The goal is to enable early recognition of unfinished activi ..."
Abstract
-
Cited by 37 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we present a novel approach of human activity prediction. Human activity prediction is a proba-bilistic process of inferring ongoing activities from videos only containing onsets (i.e. the beginning part) of the activ-ities. The goal is to enable early recognition of unfinished activities as opposed to the after-the-fact classification of completed activities. Activity prediction methodologies are particularly necessary for surveillance systems which are required to prevent crimes and dangerous activities from oc-curring. We probabilistically formulate the activity predic-tion problem, and introduce new methodologies designed for the prediction. We represent an activity as an integral histogram of spatio-temporal features, efficiently modeling how feature distributions change over time. The new recog-nition methodology named dynamic bag-of-words is devel-oped, which considers sequential nature of human activities while maintaining advantages of the bag-of-words to handle noisy observations. Our experiments confirm that our ap-proach reliably recognizes ongoing activities from stream-ing videos with a high accuracy. 1.
A comparative study of encoding, pooling and normalization methods for action recognition
- In ACCV
, 2012
"... Abstract. Bag of visual words (BoVW) models have been widely and successfully used in video based action recognition. One key step in con-structing BoVW representation is to encode feature with a codebook. Re-cently, a number of new encodingmethods have been developed to improve the performance of B ..."
Abstract
-
Cited by 27 (14 self)
- Add to MetaCart
(Show Context)
Abstract. Bag of visual words (BoVW) models have been widely and successfully used in video based action recognition. One key step in con-structing BoVW representation is to encode feature with a codebook. Re-cently, a number of new encodingmethods have been developed to improve the performance of BoVW based object recognition and scene classifica-tion, such as soft assignment encoding [1], sparse encoding [2], locality-constrained linear encoding [3] and Fisher kernel encoding [4]. However, their effects for action recognition are still unknown. The main objective of this paper is to evaluate and compare these new encoding methods in the context of video based action recognition. We also analyze and evalu-ate the combination of encoding methods with different pooling and nor-malization strategies. We carry out experiments on KTH dataset [5] and HMDB51 dataset [6]. The results show the new encoding methods can sig-nificantly improve the recognition accuracy compared with classical VQ. Among them, Fisher kernel encoding and sparse encoding have the best performance. By properly choosing pooling and normalization methods, we achieve the state-of-the-art performance on HMDB51.1 1
Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera
- In: CVPR (2013
"... Local spatio-temporal interest points (STIPs) and the re-sulting features from RGB videos have been proven success-ful at activity recognition that can handle cluttered back-grounds and partial occlusions. In this paper, we propose its counterpart in depth video and show its efficacy on ac-tivity re ..."
Abstract
-
Cited by 27 (0 self)
- Add to MetaCart
(Show Context)
Local spatio-temporal interest points (STIPs) and the re-sulting features from RGB videos have been proven success-ful at activity recognition that can handle cluttered back-grounds and partial occlusions. In this paper, we propose its counterpart in depth video and show its efficacy on ac-tivity recognition. We present a filtering method to extract STIPs from depth videos (called DSTIP) that effectively sup-press the noisy measurements. Further, we build a novel depth cuboid similarity feature (DCSF) to describe the lo-cal 3D depth cuboid around the DSTIPs with an adaptable supporting size. We test this feature on activity recognition application using the public MSRAction3D, MSRDailyAc-tivity3D datasets and our own dataset. Experimental evalu-ation shows that the proposed approach outperforms state-of-the-art activity recognition algorithms on depth videos, and the framework is more widely applicable than existing approaches. We also give detailed comparisons with other features and analysis of choice of parameters as a guidance for applications. 1.
Discriminative virtual views for crossview action recognition. CVPR
, 2012
"... Abstract We propose an approach for cross-view action recognition by way of 'virtual views' that connect the action descriptors extracted from one (source) view to those extracted from another (target) view. Each virtual view is associated with a linear transformation of the action descri ..."
Abstract
-
Cited by 26 (0 self)
- Add to MetaCart
(Show Context)
Abstract We propose an approach for cross-view action recognition by way of 'virtual views' that connect the action descriptors extracted from one (source) view to those extracted from another (target) view. Each virtual view is associated with a linear transformation of the action descriptor, and the sequence of transformations arising from the sequence of virtual views aims at bridging the source and target views while preserving discrimination among action categories. Our approach is capable of operating without access to labeled action samples in the target view and without access to corresponding action instances in the two views, and it also naturally incorporate and exploit corresponding instances or partial labeling in the target view when they are available. The proposed approach achieves improved or competitive performance relative to existing methods when instance correspondences or target labels are available, and it goes beyond the capabilities of these methods by providing some level of discrimination even when neither correspondences nor target labels exist.
Sequence of the Most Informative Joints (SMIJ): A New Representation for Human Skeletal Action Recognition
"... Much of the existing work on action recognition combines simple features (e.g., joint angle trajectories, optical flow, spatio-temporal video features) with somewhat complex classifiers or dynamical models (e.g., kernel SVMs, HMMs, LDSs, deep belief networks). Although successful, these approaches r ..."
Abstract
-
Cited by 24 (2 self)
- Add to MetaCart
(Show Context)
Much of the existing work on action recognition combines simple features (e.g., joint angle trajectories, optical flow, spatio-temporal video features) with somewhat complex classifiers or dynamical models (e.g., kernel SVMs, HMMs, LDSs, deep belief networks). Although successful, these approaches represent an action with a set of parameters that usually do not have any physical meaning. As a consequence, such approaches do not provide any qualitative insight that relates an action to the actual motion of the body or its parts. For example, it is not necessarily the case that clapping can be correlated to hand motion or that walking can be correlated to a specific combination of motions from the feet, arms and body. In this paper, we propose a new representation of human actions called Sequence of the Most Informative Joints (SMIJ), which is extremely easy to interpret. At each time instant, we automatically select a few skeletal joints that are deemed to be the most informative for performing the current action. The selection of joints is based on highly interpretable measures such as the mean or variance of joint angles, maximum angular velocity of joints, etc. We then represent an action as a sequence of these most informative joints. Our experiments on multiple databases show that the proposed representation is very discriminative for the task of human action recognition and performs better than several state-of-the-art algorithms. 1.