• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations | Disambiguate

Weakly supervised learning of interactions between humans and objects (2011)

by A Prest, C Schmid, V Ferrari
Add To MetaCart

Tools

Sorted by:
Results 11 - 20 of 57
Next 10 →

Talking heads: Detecting humans and recognizing their interactions

by Minh Hoai, Andrew Zisserman - In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition , 2014
"... The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard ar-rangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contribution ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
The objective of this work is to accurately and efficiently detect configurations of one or more people in edited TV material. Such configurations often appear in standard ar-rangements due to cinematic style, and we take advantage of this to provide scene context. We make the following contributions: first, we introduce a new learnable context aware configuration model for de-tecting sets of people in TV material that predicts the scale and location of each upper body in the configuration; sec-ond, we show that inference of the model can be solved globally and efficiently using dynamic programming, and implement a maximum margin learning framework; and third, we show that the configuration model substantially outperforms a Deformable Part Model (DPM) for predict-ing upper body locations in video frames, even when the DPM is equipped with the context of other upper bodies. Experiments are performed over two datasets: the TV Human Interaction dataset, and 150 episodes from four dif-ferent TV shows. We also demonstrate the benefits of the model in recognizing interactions in TV shows. 1.
(Show Context)

Citation Context

...recognizes people configurations with little additional processing time, relative to an individual-person detector. The output of a human UB detector can be enhanced in a number of ways: Prest et al. =-=[18]-=- combine a face detector with two UB detectors. Patron-Perez et al. [16] use tracking to link upper bodies between consecutive frames, subsequently discarding some false positives. These approaches ar...

Actions in the eye: Dynamic gaze datasets and learnt saliency models for visual recognition

by Stefan Mathe, Cristian Sminchisescu - University of Bonn
"... Abstract—Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with v ..."
Abstract - Cited by 3 (1 self) - Add to MetaCart
Abstract—Systems based on bag-of-words models from image features collected at maxima of sparse interest point operators have been used successfully for both computer visual object and action recognition tasks. While the sparse, interest-point based approach to recognition is not inconsistent with visual processing in biological systems that operate in ‘saccade and fixate ’ regimes, the methodology and emphasis in the human and the computer vision communities remains sharply distinct. Here, we make three contributions aiming to bridge this gap. First, we complement existing state-of-the art large scale dynamic computer vision annotated datasets like Hollywood-2 [1] and UCF Sports [2] with human eye movements collected under the ecological constraints of visual action and scene context recognition tasks. To our knowledge these are the first large human eye tracking datasets to be collected and made publicly available for video, vision.imar.ro/eyetracking (497,107 frames, each viewed by 19 subjects), unique in terms of their (a) large scale and computer vision relevance, (b) dynamic, video stimuli, (c) task control, as well as free-viewing. Second, we introduce novel dynamic consistency and alignment measures, which underline the remarkable stability of patterns of visual search among subjects. Third, we leverage the significant amount of collected data in order to pursue studies and build automatic, end-to-end trainable computer vision systems based on human eye movements. Our studies not only shed light on the differences between computer vision spatio-temporal interest point image sampling strategies and the human fixations, as well as their impact for visual recognition performance, but also demonstrate that human fixations can be accurately predicted, and when used in an end-to-end automatic system, leveraging some of the advanced computer vision practice, can lead to state of the art results. Index Terms—visual action recognition, human eye-movements, consistency analysis, saliency prediction, large scale learning F 1
(Show Context)

Citation Context

...AOIs are semantically meaningful and tend to correspond to physical objects. Interestingly, this supports recent computer vision strategies based on object detectors for action recognition [8], [46], =-=[47]-=-, [48]. Automatically Finding AOIs: Defining areas of interest manually is labour intensive, especially in the video domain. Therefore, we introduce an automatic method for determining their locations...

Learning semantic relationships for better action retrieval in images

by Vignesh Ramanathan, Congcong Li, Jia Deng, Wei Han, Zhen Li, Kunlong Gu, Yang Song, Samy Bengio, Chuck Rossenberg, Li Fei-fei
"... Human actions capture a wide variety of interactions between people and objects. As a result, the set of possi-ble actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by lever-aging the r ..."
Abstract - Cited by 2 (1 self) - Add to MetaCart
Human actions capture a wide variety of interactions between people and objects. As a result, the set of possi-ble actions is extremely large and it is difficult to obtain sufficient training examples for all actions. However, we could compensate for this sparsity in supervision by lever-aging the rich semantic relationship between different ac-tions. A single action is often composed of other smaller actions and is exclusive of certain others. We need a method which can reason about such relationships and extrapolate unobserved actions from known actions. Hence, we pro-pose a novel neural network framework which jointly ex-tracts the relationship between actions and uses them for training better action retrieval models. Our model incorpo-rates linguistic, visual and logical consistency based cues to effectively identify these relationships. We train and test our model on a largescale image dataset of human actions. We show a significant improvement in mean AP compared to different baseline methods including the HEX-graph ap-proach from Deng et al. [8]. 1.
(Show Context)

Citation Context

...n actions by minimizing a joint objective across all actions, and learn models for action retrieval. Action recognition Action recognition in images has been widely studied in different works such as =-=[15, 27, 32, 41, 42]-=-. They focus on improving performance for a small hand-crafted dataset of mutually exclusive actions such as the PASCAL actions and Stanford 40 actions [10, 43]. Most methods [15, 27, 42] try to impro...

Scale Coding Bag-of-Words for Action Recognition

by Fahad Shahbaz Khan, Joost Van De Weijer, Andrew D. Bagdanov, Michael Felsberg
"... Abstract—Recognizing human actions in still images is a challenging problem in computer vision due to significant amount of scale, illumination and pose variation. Given the bounding box of a person both at training and test time, the task is to classify the action associated with each bounding box ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
Abstract—Recognizing human actions in still images is a challenging problem in computer vision due to significant amount of scale, illumination and pose variation. Given the bounding box of a person both at training and test time, the task is to classify the action associated with each bounding box in an image. Most state-of-the-art methods use the bag-of-words paradigm for action recognition. The bag-of-words framework employing a dense multi-scale grid sampling strategy is the de facto standard for feature detection. This results in a scale invariant image representation where all the features at multiple-scales are binned in a single histogram. We argue that such a scale invariant strategy is sub-optimal since it ignores the multi-scale information available with each bounding box of a person. This paper investigates alternative approaches to scale coding for action recognition in still images. We encode multi-scale information explicitly in three different histograms for small, medium and large scale visual-words. Our first approach exploits multi-scale information with respect to the image size. In our second approach, we encode multi-scale information relative to the size of the bounding box of a person instance. In each approach, the multi-scale histograms are then concatenated into a single representation for action classification. We validate our approaches on the Willow dataset which contains seven action cat-egories: interacting with computer, photography, playing music, riding bike, riding horse, running and walking. Our results clearly suggest that the proposed scale coding approaches outperform the conventional scale invariant technique. Moreover, we show that our approach obtains promising results compared to more complex state-of-the-art methods. I.

Exemplar-based Recognition of Human-Object Interactions

by Jian-fang Hu, Wei-shi Zheng, Jianhuang Lai, Shaogang Gong, Tao Xiang
"... This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI ..."
Abstract - Cited by 2 (2 self) - Add to MetaCart
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
(Show Context)

Citation Context

...on recognition, exemplar modelling I. INTRODUCTION RECENTLY the problem of recognising action of a personwho is manipulating objects from a single image or video has received increasing interest [1], =-=[2]-=-, [3], [4]. In this context, the action is regarded as the Human-Object Interaction (HOI). For example, the action “playing a guitar” can be described as a human holding a guitar under some certain po...

Phrasal recognition

by Ali Farhadi, Mohammad Amin Sadeghi - IEEE Trans. PAMI
"... Abstract—In this paper, we introduce visual phrases, complex visual composites like “a person riding a horse. ” Visual phrases often display significantly reduced visual complexity compared to their component objects because the appearance of those objects can change profoundly when they participate ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
Abstract—In this paper, we introduce visual phrases, complex visual composites like “a person riding a horse. ” Visual phrases often display significantly reduced visual complexity compared to their component objects because the appearance of those objects can change profoundly when they participate in relations. We introduce a dataset suitable for phrasal recognition that uses familiar PASCAL object categories, and demonstrate significant experimental gains resulting from exploiting visual phrases. We show that a visual phrase detector significantly outperforms a baseline which detects component objects and reasons about relations, even though visual phrase training sets tend to be smaller than those for objects. We argue that any multiclass detection system must decode detector outputs to produce final results; this is usually done with nonmaximum suppression. We describe a novel decoding procedure that can account accurately for local context without solving difficult inference problems. We show this decoding procedure outperforms the state of the art. Finally, we show that decoding a combination of phrasal and object detectors produces real improvements in detector results. Index Terms—Visual phrase, phrasal recognition, visual composites, object recognition, object interactions, scene understanding, single image activity recognition, object subcategories Ç 1

Unsupervised Spectral Dual Assignment Clustering of Human Actions in Context

by Simon Jones, Ling Shao
"... A recent trend of research has shown how contex-tual information related to an action, such as a scene or object, can enhance the accuracy of human action recognition systems. However, using context to im-prove unsupervised human action clustering has never been considered before, and cannot be achi ..."
Abstract - Cited by 1 (0 self) - Add to MetaCart
A recent trend of research has shown how contex-tual information related to an action, such as a scene or object, can enhance the accuracy of human action recognition systems. However, using context to im-prove unsupervised human action clustering has never been considered before, and cannot be achieved using existing clustering methods. To solve this problem, we introduce a novel, general purpose algorithm, Dual As-signment k-Means (DAKM), which is uniquely capable of performing two co-occurring clustering tasks simul-taneously, while exploiting the correlation information to enhance both clusterings. Furthermore, we describe a spectral extension of DAKM (SDAKM) for better performance on realistic data. Extensive experiments on synthetic data and on three realistic human action datasets with scene context show that DAKM/SDAKM can significantly outperform the state-of-the-art clus-tering methods by taking into account the contextual relationship between actions and scenes. 1.
(Show Context)

Citation Context

...inbis and Sclaroff [3] go further, combining object, scene and action information in a multiple instances learning framework, to improve the classification performance of YouTube videos. Prest et al. =-=[10]-=- use a weakly supervised framework to learn the interaction between human actions and the objects in the scene, in particular learning the spatial relationship between actions and objects. All of thes...

Recognising Human-Object Interaction via Exemplar based Modelling

by Jianhuang Lai, Shaogang Gong, Tao Xiang
"... Human action can be recognised from a single still im-age by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between hu-man and object as well as their appearance. Existing ap-proaches rely heavily on accurate detection of human and object, and estimat ..."
Abstract - Add to MetaCart
Human action can be recognised from a single still im-age by modelling Human-object interaction (HOI), which infers the mutual spatial structure information between hu-man and object as well as their appearance. Existing ap-proaches rely heavily on accurate detection of human and object, and estimation of human pose. They are thus sensi-tive to large variations of human poses, occlusion and un-satisfactory detection of small size objects. To overcome this limitation, a novel exemplar based approach is pro-posed in this work. Our approach learns a set of spatial pose-object interaction exemplars, which are density func-tions describing how a person is interacting with a manip-ulated object for different activities spatially in a proba-bilistic way. A representation based on our HOI exemplar thus has great potential for being robust to the errors in human/object detection and pose estimation. A new frame-work consists of a proposed exemplar based HOI descrip-tor and an activity specific matching model that learns the parameters is formulated for robust human activity recog-nition. Experiments on two benchmark activity datasets demonstrate that the proposed approach obtains state-of-the-art performance. 1.
(Show Context)

Citation Context

...asets demonstrate that the proposed approach obtains state-ofthe-art performance. 1. Introduction Recently the problem of recognising human action from a single image has received increasing interest =-=[23, 1, 5, 21]-=-. In this context, action can be defined as the Human-Object Interaction (HOI). Existing approaches focus on modelling the co-occurrence or spatial relationship between human and the manipulated objec...

Grenoble- Rhône-Alpes THEME

by Université Joseph Fourier
"... ..."
Abstract - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...large scale learning. In NIPS,2007. 1, 4 [9] K. Chatfield, V. Lempitsky, A. Vedaldi, and A. Zisserman. The devil is in the details: an evaluation of recent feature encoding methods. In BMVC,2011.1, 5 =-=[10]-=- K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2001.1, 3 8 training. In CVPR,2011. 1, 2, 4, 7 [20] D. Lowe. Distinctive image features f...

DISS. ETH NO. 20612

by Objects , 2012
"... Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires ..."
Abstract - Add to MetaCart
Modern Computer Vision systems learn visual concepts through examples (i.e. images) which have been manually annotated by humans. While this paradigm allowed the field to tremendously progress in the last decade, it has now become one of its major bottlenecks. Teaching a new visual concept requires an expensive human annotation effort, limiting systems to scale to thousands of visual concepts from the few dozens that work today. The exponential growth of visual data available on the net represents an invaluable resource for visual learning algorithms and calls for new methods able to exploit this information to learn visual concepts without the need of major human annotation effort. As a first contribution, we introduce an approach for learning human actions as interactions between persons and objects in realistic images. By exploiting the spatial structure of human-object interactions, we are able to learn action models automatically from a set of still images annotated only with the action label (weakly-supervised). Extensive experimental evaluation demonstrates that our weakly-supervised approach achieves the same performance of popular fully-supervised methods despite using substantially less supervision.
Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University