Results 11 - 20
of
57
Talking heads: Detecting humans and recognizing their interactions
- In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition
, 2014
"... The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard ar-rangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contribution ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard ar-rangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contributions: first, we introduce a new learnable context aware configuration model for de-tecting sets of people in TV material that predicts the scale and location of each upper body in the configuration; sec-ond, we show that inference of the model can be solved globally and efficiently using dynamic programming, and implement a maximum margin learning framework; and third, we show that the configuration model substantially outperforms a Deformable Part Model (DPM) for predict-ing upper body locations in video frames, even when the DPM is equipped with the context of other upper bodies. Experiments are performed over two datasets: the TV Human Interaction dataset, and 150 episodes from four dif-ferent TV shows. We also demonstrate the benefits of the model in recognizing interactions in TV shows. 1.
Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition
- University of Bonn
"... Abstract—Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with v ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
Abstract—Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in ‘saccade and fixate ’ regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of visual action and scene context recognition tasks. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 19 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as well as free-viewing. Second, we introduce novel dynamic consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results. Index Terms—visual action recognition, human eye-movements, consistency analysis, saliency prediction, large scale learning F 1
Learning semantic relationships for better action retrieval in images
"... Human actions capture a wide variety of interactions between people and objects. As a result, the set of possi-ble actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by lever-aging the r ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
(Show Context)
Human actions capture a wide variety of interactions between people and objects. As a result, the set of possi-ble actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by lever-aging the rich semantic relationship between different ac-tions. A single action is often composed of other smaller actions and is exclusive of certain others. We need a method which can reason about such relationships and extrapolate unobserved actions from known actions. Hence, we pro-pose a novel neural network framework which jointly ex-tracts the relationship between actions and uses them for training better action retrieval models. Our model incorpo-rates linguistic, visual and logical consistency based cues to effectively identify these relationships. We train and test our model on a largescale image dataset of human actions. We show a significant improvement in mean AP compared to different baseline methods including the HEX-graph ap-proach from Deng et al. [8]. 1.
Scale Coding Bag-of-Words for Action Recognition
"... Abstract—Recognizing human actions in still images is a challenging problem in computer vision due to significant amount of scale, illumination and pose variation. Given the bounding box of a person both at training and test time, the task is to classify the action associated with each bounding box ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—Recognizing human actions in still images is a challenging problem in computer vision due to significant amount of scale, illumination and pose variation. Given the bounding box of a person both at training and test time, the task is to classify the action associated with each bounding box in an image. Most state-of-the-art methods use the bag-of-words paradigm for action recognition. The bag-of-words framework employing a dense multi-scale grid sampling strategy is the de facto standard for feature detection. This results in a scale invariant image representation where all the features at multiple-scales are binned in a single histogram. We argue that such a scale invariant strategy is sub-optimal since it ignores the multi-scale information available with each bounding box of a person. This paper investigates alternative approaches to scale coding for action recognition in still images. We encode multi-scale information explicitly in three different histograms for small, medium and large scale visual-words. Our first approach exploits multi-scale information with respect to the image size. In our second approach, we encode multi-scale information relative to the size of the bounding box of a person instance. In each approach, the multi-scale histograms are then concatenated into a single representation for action classification. We validate our approaches on the Willow dataset which contains seven action cat-egories: interacting with computer, photography, playing music, riding bike, riding horse, running and walking. Our results clearly suggest that the proposed scale coding approaches outperform the conventional scale invariant technique. Moreover, we show that our approach obtains promising results compared to more complex state-of-the-art methods. I.
Exemplar-based Recognition of Human-Object Interactions
"... This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
Phrasal recognition
- IEEE Trans. PAMI
"... Abstract—In this paper, we introduce visual phrases, complex visual composites like “a person riding a horse. ” Visual phrases often display significantly reduced visual complexity compared to their component objects because the appearance of those objects can change profoundly when they participate ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract—In this paper, we introduce visual phrases, complex visual composites like “a person riding a horse. ” Visual phrases often display significantly reduced visual complexity compared to their component objects because the appearance of those objects can change profoundly when they participate in relations. We introduce a dataset suitable for phrasal recognition that uses familiar PASCAL object categories, and demonstrate significant experimental gains resulting from exploiting visual phrases. We show that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects. We argue that any multiclass detection system must decode detector outputs to produce final results; this is usually done with nonmaximum suppression. We describe a novel decoding procedure that can account accurately for local context without solving difficult inference problems. We show this decoding procedure outperforms the state of the art. Finally, we show that decoding a combination of phrasal and object detectors produces real improvements in detector results. Index Terms—Visual phrase, phrasal recognition, visual composites, object recognition, object interactions, scene understanding, single image activity recognition, object subcategories Ç 1
Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context
"... A recent trend of research has shown how contex-tual information related to an action, such as a scene or object, can enhance the accuracy of human action recognition systems. However, using context to im-prove unsupervised human action clustering has never been considered before, and cannot be achi ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
A recent trend of research has shown how contex-tual information related to an action, such as a scene or object, can enhance the accuracy of human action recognition systems. However, using context to im-prove unsupervised human action clustering has never been considered before, and cannot be achieved using existing clustering methods. To solve this problem, we introduce a novel, general purpose algorithm, Dual As-signment k-Means (DAKM), which is uniquely capable of performing two co-occurring clustering tasks simul-taneously, while exploiting the correlation information to enhance both clusterings. Furthermore, we describe a spectral extension of DAKM (SDAKM) for better performance on realistic data. Extensive experiments on synthetic data and on three realistic human action datasets with scene context show that DAKM/SDAKM can significantly outperform the state-of-the-art clus-tering methods by taking into account the contextual relationship between actions and scenes. 1.
Recognising Human-Object Interaction via Exemplar based Modelling
"... Human action can be recognised from a single still im-age by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between hu-man and object as well as their appearance. Existing ap-proaches rely heavily on accurate detection of human and object, and estimat ..."
Abstract
- Add to MetaCart
(Show Context)
Human action can be recognised from a single still im-age by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between hu-man and object as well as their appearance. Existing ap-proaches rely heavily on accurate detection of human and object, and estimation of human pose. They are thus sensi-tive to large variations of human poses, occlusion and un-satisfactory detection of small size objects. To overcome this limitation, a novel exemplar based approach is pro-posed in this work. Our approach learns a set of spatial pose-object interaction exemplars, which are density func-tions describing how a person is interacting with a manip-ulated object for different activities spatially in a proba-bilistic way. A representation based on our HOI exemplar thus has great potential for being robust to the errors in human/object detection and pose estimation. A new frame-work consists of a proposed exemplar based HOI descrip-tor and an activity specific matching model that learns the parameters is formulated for robust human activity recog-nition. Experiments on two benchmark activity datasets demonstrate that the proposed approach obtains state-of-the-art performance. 1.
DISS. ETH NO. 20612
, 2012
"... Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires ..."
Abstract
- Add to MetaCart
Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires an expensive human annotation effort, limiting systems to scale to thousands of visual concepts from the few dozens that work today. The exponential growth of visual data available on the net represents an invaluable resource for visual learning algorithms and calls for new methods able to exploit this information to learn visual concepts without the need of major human annotation effort. As a first contribution, we introduce an approach for learning human actions as interactions between persons and objects in realistic images. By exploiting the spatial structure of human-object interactions, we are able to learn action models automatically from a set of still images annotated only with the action label (weakly-supervised). Extensive experimental evaluation demonstrates that our weakly-supervised approach achieves the same performance of popular fully-supervised methods despite using substantially less supervision.