• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

Learning human activities and object affordances from rgb-d videos. (2012)

by H Swetha Koppula, R Gupta, A Saxena
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 59
Next 10 →

Anticipating human activities using object affordances for reactive robotic response

by Ashutosh Saxena
"... Abstract—An important aspect of human perception is anticipation, which we use extensively in our day-to-day activities when interacting with other humans as well as with our surroundings. Anticipating which activities will a human do next (and how) can enable an assistive robot to plan ahead for re ..."
Abstract - Cited by 44 (15 self) - Add to MetaCart
Abstract—An important aspect of human perception is anticipation, which we use extensively in our day-to-day activities when interacting with other humans as well as with our surroundings. Anticipating which activities will a human do next (and how) can enable an assistive robot to plan ahead for reactive responses in human environments. Furthermore, anticipation can even improve the detection accuracy of past activities. The challenge, however, is two-fold: We need to capture the rich context for modeling the activities and object affordances, and we need to anticipate the distribution over a large space of future human activities. In this work, we represent each possible future using an anticipatory temporal conditional random field (ATCRF) that models the rich spatial-temporal relations through object affordances. We then consider each ATCRF as a particle and represent the distribution over the potential futures using a set of particles. In extensive evaluation on CAD-120 human activity RGB-D dataset, we first show that anticipation improves the state-ofthe-art detection results. For a new subjects (not seen in the training set), we obtain an activity anticipation accuracy (defined as whether one of top three predictions actually happened) of 75.4%, 69.2 % and 58.1 % for an anticipation time of 1, 3 and 10 seconds respectively. Finally, we also use our algorithm on a robot for performing a few reactive responses. I.
(Show Context)

Citation Context

...f past activities). There has been a significant amount of work in detecting human activities from 2D RGB videos [35, 29, 27], from inertial/location sensors [21], and more recently from RGB-D videos =-=[19, 34, 25]-=-. The primary approach in these works is to first convert the input sensor stream into a spatio-temporal representation, and then to infer labels over the inputs. These works use different types of in...

Hallucinated Humans as the Hidden Context for Labeling 3D Scenes

by Yun Jiang, Ashutosh Saxena, Environment Orison Swett Marden
"... For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated onl ..."
Abstract - Cited by 30 (15 self) - Add to MetaCart
For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task of attribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.

3D-based reasoning with blocks, support, and stability

by Zhaoyin Jia, Andrew Gallagher, Ashutosh Saxena, Tsuhan Chen - IN CVPR , 2013
"... 3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other o ..."
Abstract - Cited by 23 (5 self) - Add to MetaCart
3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other objects to fall. We propose a new approach for parsing RGB-D images using 3D block units for volumetric reasoning. The algorithm fits image segments with 3D blocks, and iteratively evaluates the scene based on block interaction properties. We produce a 3D representation of the scene based on jointly optimizing over segmentations, block fitting, supporting relations, and object stability. Our algorithm incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting (i.e., one that does not topple) arrangement of objects. We experiment on several datasets including controlled and real indoor scenarios. Results show that our stability-reasoning framework improves RGB-D segmentation and scene volumetric representation.
(Show Context)

Citation Context

...has shown that integrating depth with color information improves many vision problems, such as segmentation [23], object recognition ([12], [13] and [17]), scene labeling [15], and activity detection =-=[16]-=-. These algorithms usually treat depth as another information channel without explicitly reasoning about the space that an object occupies. For example, if one object is partially observed, it remains...

Representing Videos using Mid-level Discriminative Patches

by Arpit Jain, Abhinav Gupta, Mikel Rodriguez, Larry S. Davis
"... How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the vide ..."
Abstract - Cited by 22 (1 self) - Add to MetaCart
How should a video be represented? We propose a new representation for videos based on mid-level discriminative spatio-temporal patches. These spatio-temporal patches might correspond to a primitive human action, a semantic object, or perhaps a random but informative spatiotemporal patch in the video. What defines these spatiotemporal patches is their discriminative and representative properties. We automatically mine these patches from hundreds of training videos and experimentally demonstrate that these patches establish correspondence across videos and align the videos for label transfer techniques. Furthermore, these patches can be used as a discriminative vocabulary for action classification where they demonstrate stateof-the-art performance on UCF50 and Olympics datasets. 1.
(Show Context)

Citation Context

...establishing correspondence. The third class of approaches is structural and decomposes videos into constituent parts. These parts typically correspond to semantic entities such as humans and objects =-=[10, 34, 14]-=-. While these approaches attempt to develop a rich representation and learn the structure of the videos in terms of constituent objects, one of their inherent drawbacks is that they are highly depende...

Deep Learning for Detecting Robotic Grasps

by Ian Lenz, Honglak Lee, Ashutosh Saxena
"... Abstract—We consider the problem of detecting robotic grasps in an RGB-D view of a scene containing objects. In this work, we apply a deep learning approach to solve this problem, which avoids time-consuming hand-design of features. This presents two main challenges. First, we need to evaluate a hug ..."
Abstract - Cited by 22 (6 self) - Add to MetaCart
Abstract—We consider the problem of detecting robotic grasps in an RGB-D view of a scene containing objects. In this work, we apply a deep learning approach to solve this problem, which avoids time-consuming hand-design of features. This presents two main challenges. First, we need to evaluate a huge number of candidate grasps. In order to make detection fast, as well as robust, we present a two-step cascaded structure with two deep networks, where the top detections from the first are re-evaluated by the second. The first network has fewer features, is faster to run, and can effectively prune out unlikely candidate grasps. The second, with more features, is slower but has to run only on the top few detections. Second, we need to handle multimodal inputs well, for which we present a method to apply structured regularization on the weights based on multimodal group regularization. We demonstrate that our method outperforms the previous state-of-the-art methods in robotic grasp detection. I.
(Show Context)

Citation Context

...xpensive depth sensors, RGB-D data has been a significant research focus in recent years for various applications. For example, Jiang et al. [15] consider robotic placement of objects, Koppula et al. =-=[17]-=- consider human activity detection, and Koppula et al. [16] consider object detection in 3D scenes. Most works with RGB-D data use hand-engineered features such as [32]. The few works that perform fea...

Learning Spatio-Temporal Structure from RGB-D Videos for Human Activity Detection and Anticipation

by Ashutosh Saxena
"... We consider the problem of detecting past activities as well as anticipating which activity will happen in the future and how. We start by modeling the rich spatio-temporal relations between human poses and objects (called affordances) using a conditional random field (CRF). However, because of the ..."
Abstract - Cited by 18 (5 self) - Add to MetaCart
We consider the problem of detecting past activities as well as anticipating which activity will happen in the future and how. We start by modeling the rich spatio-temporal relations between human poses and objects (called affordances) using a conditional random field (CRF). However, because of the ambiguity in the temporal segmentation of the sub-activities that constitute an activity, in the past as well as in the future, multiple graph structures are possible. In this paper, we reason about these alternate possibilities by reasoning over multiple possible graph structures. We obtain them by approximating the graph with only additive features, which lends to efficient dynamic programming. Starting with this proposal graph structure, we then design moves to obtain several other likely graph structures. We then show that our approach improves the state-of-the-art significantly for detecting past activities as well as for anticipating future activities, on a dataset of 120 activity videos collected from four subjects. 1.

Infinite latent conditional random fields for modeling environments through humans

by Yun Jiang, Ashutosh Saxena - in RSS , 2013
"... Abstract—Humans cast a substantial influence on their en-vironments by interacting with it. Therefore, even though an environment may physically contain only objects, it cannot be modeled well without considering humans. In this paper, we model environments not only through objects, but also through ..."
Abstract - Cited by 15 (4 self) - Add to MetaCart
Abstract—Humans cast a substantial influence on their en-vironments by interacting with it. Therefore, even though an environment may physically contain only objects, it cannot be modeled well without considering humans. In this paper, we model environments not only through objects, but also through latent human poses and human-object interactions. However, the number of potential human poses is large and unknown, and the human-object interactions vary not only in type but also in which human pose relates to each object. In order to handle such properties, we present Infinite Latent Conditional Random Fields (ILCRFs) that model a scene as a mixture of CRFs generated from Dirichlet processes. Each CRF represents one possible explanation of the scene. In addition to visible object nodes and edges, it generatively models the distribution of different CRF structures over the latent human nodes and corresponding edges. We apply the model to the chal-lenging application of robotic scene arrangement. In extensive experiments, we show that our model significantly outperforms the state-of-the-art results. We further use our algorithm on a robot for placing objects in a new scene. I.
(Show Context)

Citation Context

...ic) CRFs [33] substitute every label with a Markov network structure to allow structured labeling, especially for sequential data (such as labeling object and action simultaneously in video sequences =-=[18, 21]-=-). However, the labels and hidden states are discrete and take only finite number of values. In contemporary work, Bousmalis et al. [4] present a model that shares a name similar to ours, but is quite...

Modeling 4D Human-Object Interactions for Event and Object Recognition

by Ping Wei, Yibiao Zhao, Nanning Zheng, Song-chun Zhu
"... Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is ..."
Abstract - Cited by 12 (2 self) - Add to MetaCart
Recognizing the events and objects in the video sequence are two challenging tasks due to the complex temporal structures and the large appearance variations. In this paper, we propose a 4D human-object interaction model, where the two tasks jointly boost each other. Our human-object interaction is defined in 4D space: i) the cooccurrence and geometric constraints of human pose and object in 3D space; ii) the sub-events transition and objects coherence in 1D temporal dimension. We represent the structure of events, sub-events and objects in a hierarchical graph. For an input RGB-depth video, we design a dynamic programming beam search algorithm to: i) segment the video, ii) recognize the events, and iii) detect the objects simultaneously. For evaluation, we built a large-scale multiview 3D event dataset which contains 3815 video sequences and 383,036 RGBD frames captured by the Kinect cameras. The experiment results on this dataset show the effectiveness of our method. 1.
(Show Context)

Citation Context

...he experiment result demonstrates the strength of our model. 1.1. Related Work Human-object Context. In recent years, many work applied the human-object mutual context to event and object recognition =-=[2, 5, 7, 9, 11, 14, 15, 23, 24]-=-. Gupta et al. [5] combined the spatial and functional constraint between human and objects to recognize action and object. Yao and Fei-Fei [24] modeled the relations between actions, objects, and pos...

Fusing Spatiotemporal Features and Joints for 3D Action Recognition.

by Yu Zhu , Wenbin Chen , Guodong Guo - In CVPRW, , 2013
"... Abstract We present a novel approach to 3D ..."
Abstract - Cited by 12 (0 self) - Add to MetaCart
Abstract We present a novel approach to 3D
(Show Context)

Citation Context

...ps [27] 91.6% Our approach 94.3% Table 3. Performance on CAD-60 dataset using different methods. Method Precision/Recall J. Sung et al. [17] 67.9%/55.5% X. Yang et al. [28] 71.9%/66.6% Koppula et al. =-=[18]-=- 80.8%/71.4% Our approach 93.2%/84.6% process. These two features can be complementary to each other, and an efficient combination of them can improve the 3D action recognition accuracies. We have con...

A.: Hierarchical semantic labeling for task-relevant rgb-d perception

by Chenxia Wu, Ian Lenz, Ashutosh Saxena - In: RSS (2014
"... Abstract—Semantic labeling of RGB-D scenes is very impor-tant in enabling robots to perform mobile manipulation tasks, but different tasks may require entirely different sets of labels. For example, when navigating to an object, we may need only a single label denoting its class, but to manipulate i ..."
Abstract - Cited by 10 (7 self) - Add to MetaCart
Abstract—Semantic labeling of RGB-D scenes is very impor-tant in enabling robots to perform mobile manipulation tasks, but different tasks may require entirely different sets of labels. For example, when navigating to an object, we may need only a single label denoting its class, but to manipulate it, we might need to identify individual parts. In this work, we present an algorithm that produces hierarchical labelings of a scene, following is-part-of and is-type-of relationships. Our model is based on a Conditional Random Field that relates pixel-wise and pair-wise observations to labels. We encode hierarchical labeling constraints into the model while keeping inference tractable. Our model thus predicts different specificities in labeling based on its confidence—if it is not sure whether an object is Pepsi or Sprite, it will predict soda rather than making an arbitrary choice. In extensive experiments, both offline on standard datasets as well as in online robotic experiments, we show that our model outperforms other state-of-the-art methods in labeling performance as well as in success rate for robotic tasks. I.
(Show Context)

Citation Context

...ene understanding from 2D images has been widely explored [41, 43, 12, 8]. Due to the availability of affordable RGB-D sensors, significant effort has been put into RGB-D scene understanding recently =-=[45, 37, 24, 31, 4, 25, 14, 20, 19, 17, 27]-=-. Ren et al. [37] developed Kernel Descriptors, highly useful RGB-D feature, and used the segmentation tree to get contextual information. Gupta et al. [14] generalized 2D gPb-ucm contour detection to...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University