Results 1 - 10
of
18
M.J.: Towards understanding action recognition
- In: ICCV (2013
"... Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many re-cent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear what affects the results most. This paper attempts to pro-vide insight ..."
Abstract
-
Cited by 21 (5 self)
- Add to MetaCart
(Show Context)
Although action recognition in videos is widely studied, current methods often fail on real-world datasets. Many re-cent approaches improve accuracy and robustness to cope with challenging video sequences, but it is often unclear what affects the results most. This paper attempts to pro-vide insights based on a systematic performance evalua-tion using thoroughly-annotated data of human actions. We annotate human Joints for the HMDB dataset (J-HMDB). This annotation can be used to derive ground truth optical flow and segmentation. We evaluate current methods using this dataset and systematically replace the output of various algorithms with ground truth. This enables us to discover what is important – for example, should we work on improv-ing flow algorithms, estimating human bounding boxes, or enabling pose estimation? In summary, we find that high-level pose features greatly outperform low/mid level fea-tures; in particular, pose over time is critical. While current pose estimation algorithms are far from perfect, features extracted from estimated pose on a subset of J-HMDB, in which the full body is visible, outperform low/mid-level fea-tures. We also find that the accuracy of the action recog-nition framework can be greatly increased by refining the underlying low/mid level features; this suggests it is im-portant to improve optical flow and human detection algo-rithms. Our analysis and J-HMDB dataset should facilitate a deeper understanding of action recognition algorithms. 1.
A survey on human motion analysis from depth data
- In: Time-of-Flight and Depth Imaging. Sensors, Algorithms, and Applications
, 2013
"... Abstract. Human pose estimation has been actively studied for decades. While traditional approaches rely on 2d data like images or videos, the development of Time-of-Flight cameras and other depth sensors created new opportunities to advance the field. We give an overview of recent approaches that p ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Human pose estimation has been actively studied for decades. While traditional approaches rely on 2d data like images or videos, the development of Time-of-Flight cameras and other depth sensors created new opportunities to advance the field. We give an overview of recent approaches that perform human motion analysis which includes depth-based and skeleton-based activity recognition, head pose estimation, fa-cial feature detection, facial performance capture, hand pose estimation and hand gesture recognition. While the focus is on approaches using depth data, we also discuss traditional image based methods to provide a broad overview of recent developments in these areas. 1
W.: Unified face analysis by iterative multi-output random forests
- In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR
, 2014
"... In this paper, we present a unified method for joint face image analysis, i.e., simultaneously estimating head pose, facial expression and landmark positions in real-world face images. To achieve this goal, we propose a novel iterative Multi-Output Random Forests (iMORF) algorithm, which explicitly ..."
Abstract
-
Cited by 5 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we present a unified method for joint face image analysis, i.e., simultaneously estimating head pose, facial expression and landmark positions in real-world face images. To achieve this goal, we propose a novel iterative Multi-Output Random Forests (iMORF) algorithm, which explicitly models the relations among multiple tasks and it-eratively exploits such relations to boost the performance of all tasks. Specifically, a hierarchical face analysis for-est is learned to perform classification of pose and expres-sion at the top level, while performing landmark positions regression at the bottom level. On one hand, the estimat-ed pose and expression provide strong shape prior to con-strain the variation of landmark positions. On the other hand, more discriminative shape-related features could be extracted from the estimated landmark positions to further improve the predictions of pose and expression. This relat-edness of face analysis tasks is iteratively exploited through several cascaded hierarchical face analysis forests until convergence. Experiments conducted on publicly available real-world face datasets demonstrate that the performance of all individual tasks are significantly improved by the pro-posed iMORF algorithm. In addition, our method outper-forms state-of-the-arts for all three face analysis tasks. 1.
Leveraging Hierarchical Parametric Networks for Skeletal Joints Based Action Segmentation and Recognition
"... Over the last few years, with the immense popularity of the Kinect, there has been renewed interest in develop-ing methods for human gesture and action recognition from 3D skeletal data. A number of approaches have been pro-posed to extract representative features from 3D skeletal data, most commonl ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
(Show Context)
Over the last few years, with the immense popularity of the Kinect, there has been renewed interest in develop-ing methods for human gesture and action recognition from 3D skeletal data. A number of approaches have been pro-posed to extract representative features from 3D skeletal data, most commonly hard wired geometric or bio-inspired shape context features. We propose a hierarchial dynamic framework that first extracts high level skeletal joints fea-tures and then uses the learned representation for estimat-ing emission probability to infer action sequences. Current-ly gaussian mixture models are the dominant technique for modeling the emission distribution of hidden Markov mod-els. We show that better action recognition using skele-tal features can be achieved by replacing gaussian mixture models by deep neural networks that contain many layers of features to predict probability distributions over states of hidden Markov models. The framework can be easily ex-tended to include a ergodic state to segment and recognize actions simultaneously. 1.
Efficient Pose-based Action Recognition
"... Abstract. Action recognition from 3d pose data has gained increasing attention since the data is readily available for depth or RGB-D videos. The most successful approaches so far perform an expensive feature se-lection or mining approach for training. In this work, we introduce an algorithm that is ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Action recognition from 3d pose data has gained increasing attention since the data is readily available for depth or RGB-D videos. The most successful approaches so far perform an expensive feature se-lection or mining approach for training. In this work, we introduce an algorithm that is very efficient for training and testing. The main idea is that rich structured data like 3d pose does not require sophisticated feature modeling or learning. Instead, we reduce pose data over time to histograms of relative location, velocity, and their correlations and use partial least squares to learn a compact and discriminative representation from it. Despite of its efficiency, our approach achieves state-of-the-art accuracy on four different benchmarks. We further investigate differences of 2d and 3d pose data for action recognition. 1
Joint action recognition and pose estimation from video
- in Proc. IEEE Computer Society Conference on Computer Vision and Pattern Recognition
, 2015
"... Action recognition and pose estimation from video are closely related tasks for understanding human motion, most methods, however, learn separate models and combine them sequentially. In this paper, we propose a framework to in-tegrate training and testing of the two tasks. A spatial-temporal And-Or ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Action recognition and pose estimation from video are closely related tasks for understanding human motion, most methods, however, learn separate models and combine them sequentially. In this paper, we propose a framework to in-tegrate training and testing of the two tasks. A spatial-temporal And-Or graph model is introduced to represent ac-tion at three scales. Specifically the action is decomposed into poses which are further divided to mid-level ST-parts and then parts. The hierarchical structure of our model captures the geometric and appearance variations of pose at each frame and lateral connections between ST-parts at adjacent frames capture the action-specific motion informa-tion. The model parameters for three scales are learned dis-criminatively, and action labels and poses are efficiently in-ferred by dynamic programming. Experiments demonstrate that our approach achieves state-of-art accuracy in action recognition while also improving pose estimation.
Spatio-temporal Matching for Human Detection in Video
"... Abstract. Detection and tracking humans in videos have been long-standing problems in computer vision. Most successful approaches (e.g., deformable parts models) heavily rely on discriminative models to build appearance detectors for body joints and generative models to constrain possible body confi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Detection and tracking humans in videos have been long-standing problems in computer vision. Most successful approaches (e.g., deformable parts models) heavily rely on discriminative models to build appearance detectors for body joints and generative models to constrain possible body configurations (e.g., trees). While these 2D models have been successfully applied to images (and with less success to videos), a major challenge is to generalize these models to cope with camera views. In order to achieve view-invariance, these 2D models typically require a large amount of training data across views that is difficult to gather and time-consuming to label. Unlike existing 2D models, this paper for-mulates the problem of human detection in videos as spatio-temporal matching (STM) between a 3D motion capture model and trajectories in videos. Our algorithm estimates the camera view and selects a subset of tracked trajectories that matches the motion of the 3D model. The STM is efficiently solved with linear programming, and it is robust to tracking mismatches, occlusions and outliers. To the best of our knowl-edge this is the first paper that solves the correspondence between video and 3D motion capture data for human pose detection. Experiments on the Human3.6M and Berkeley MHAD databases illustrate the benefits of our method over state-of-the-art approaches. 1
Cross-view action modeling, learning and recognition
, 2014
"... Abstract Existing methods on video-based action recognition are generally view-dependent, i.e., performing ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract Existing methods on video-based action recognition are generally view-dependent, i.e., performing
Spatial, Temporal and Spatio-Temporal Correspondence for Computer Vision Problems
, 2014
"... Many computer vision problems, such as object classification, motion estimation or shape registration rely on solving the correspondence problem. Existing al-gorithms to solve spatial or temporal correspondence problems are usually NP-hard, difficult to approximate, lack flexible models and mechanis ..."
Abstract
- Add to MetaCart
Many computer vision problems, such as object classification, motion estimation or shape registration rely on solving the correspondence problem. Existing al-gorithms to solve spatial or temporal correspondence problems are usually NP-hard, difficult to approximate, lack flexible models and mechanism for feature weighting. This proposal addresses the correspondence problem in computer vision, and proposes two new spatio-temporal correspondence problems and three algorithms to solve spatial, temporal and spatio-temporal matching between video and other sources. The main contributions of the thesis are: (1) Factorial graph matching (FGM). FGM extends existing work on graph match-ing (GM) by finding an exact factorization of the affinity matrix. Four are the benefits that follow from this factorization: (a) There is no need to compute the costly (in space and time) pairwise affinity matrix; (b) It provides a unified framework that reveals commonalities and differences between GM methods. Moreover, the factorization provides a clean connection with other matching algorithms such as
Affective Scenes Influence Fear Perception of Individual Body Expressions
, 2012
"... Abstract: In natural viewing conditions, different stimulus categories such as people, objects, and natural scenes carry relevant affective information that is usually processed simultaneously. But these different signals may not always have the same affective meaning. Using body-scene compound sti ..."
Abstract
- Add to MetaCart
Abstract: In natural viewing conditions, different stimulus categories such as people, objects, and natural scenes carry relevant affective information that is usually processed simultaneously. But these different signals may not always have the same affective meaning. Using body-scene compound stimuli, we investigated how the brain processes fearful signals conveyed by either a body in the foreground or scenes in the background and the interaction between foreground body and background scene. The results showed that left and right extrastriate body areas (EBA) responded more to fearful than to neutral bodies. More interestingly, a threatening background scene compared to a neutral one showed increased activity in bilateral EBA and right-posterior parahippocampal place area (PPA) and decreased activity in right retrosplenial cortex (RSC) and left-anterior PPA. The emotional scene effect in EBA was only present when the foreground body was neutral and not when the body posture expressed fear (significant emotion-by-category interaction effect), consistent with behavioral ratings. The results provide evidence for emotional influence of the background scene on the processing of body expressions. Hum Brain Mapp 00:000-000,