Results 1 - 10
of
27
Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
"... For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated onl ..."
Abstract
-
Cited by 30 (15 self)
- Add to MetaCart
(Show Context)
For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task of attribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
People Watching -- Human Actions as a Cue for Single View Geometry
"... We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These c ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene under-standing approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Inter-net. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.
Infinite latent conditional random fields for modeling environments through humans
- in RSS
, 2013
"... Abstract—Humans cast a substantial influence on their en-vironments by interacting with it. Therefore, even though an environment may physically contain only objects, it cannot be modeled well without considering humans. In this paper, we model environments not only through objects, but also through ..."
Abstract
-
Cited by 15 (4 self)
- Add to MetaCart
(Show Context)
Abstract—Humans cast a substantial influence on their en-vironments by interacting with it. Therefore, even though an environment may physically contain only objects, it cannot be modeled well without considering humans. In this paper, we model environments not only through objects, but also through latent human poses and human-object interactions. However, the number of potential human poses is large and unknown, and the human-object interactions vary not only in type but also in which human pose relates to each object. In order to handle such properties, we present Infinite Latent Conditional Random Fields (ILCRFs) that model a scene as a mixture of CRFs generated from Dirichlet processes. Each CRF represents one possible explanation of the scene. In addition to visible object nodes and edges, it generatively models the distribution of different CRF structures over the latent human nodes and corresponding edges. We apply the model to the chal-lenging application of robotic scene arrangement. In extensive experiments, we show that our model significantly outperforms the state-of-the-art results. We further use our algorithm on a robot for placing objects in a new scene. I.
Box in the box: Joint 3D layout and object reasoning from single images
, 2013
"... In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. To-wards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account o ..."
Abstract
-
Cited by 11 (3 self)
- Add to MetaCart
(Show Context)
In this paper we propose an approach to jointly infer the room layout as well as the objects present in the scene. To-wards this goal, we propose a branch and bound algorithm which is guaranteed to retrieve the global optimum of the joint problem. The main difficulty resides in taking into account occlusion in order to not over-count the evidence. We introduce a new decomposition method, which generalizes integral geometry to triangular shapes, and allows us to bound the different terms in constant time. We exploit both geometric cues and object detectors as image features and show large improvements in 2D and 3D object detection over state-of-the-art deformable part-based models.
Functional Object Descriptors for Human Activity Modeling
"... Abstract — The ability to learn from human demonstration is essential for robots in human environments. The activity models that the robot builds from observation must take both the human motion and the objects involved into account. Object models designed for this purpose should reflect the role of ..."
Abstract
-
Cited by 9 (1 self)
- Add to MetaCart
(Show Context)
Abstract — The ability to learn from human demonstration is essential for robots in human environments. The activity models that the robot builds from observation must take both the human motion and the objects involved into account. Object models designed for this purpose should reflect the role of the object in the activity – its function, or affordances. The main contribution of this paper is to represent object directly in terms of their interaction with human hands, rather than in terms of appearance. This enables the direct representation of object affordances/function, while being robust to intraclass differences in appearance. Object hypotheses are first extracted from a video sequence as tracks of associated image segments. The object hypotheses are encoded as strings, where the vocabulary corresponds to different types of interaction with human hands. The similarity between two such object descriptors can be measured using a string kernel. Experiments show these functional descriptors to capture differences and similarities in object affordances/function that are not represented by appearance. I.
Shape2Pose: Human-Centric Shape Analysis
"... As 3D acquisition devices and modeling tools become widely avail-able there is a growing need for automatic algorithms that analyze the semantics and functionality of digitized shapes. Most recent research has focused on analyzing geometric structures of shapes. Our work is motivated by the observat ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
As 3D acquisition devices and modeling tools become widely avail-able there is a growing need for automatic algorithms that analyze the semantics and functionality of digitized shapes. Most recent research has focused on analyzing geometric structures of shapes. Our work is motivated by the observation that a majority of man-made shapes are designed to be used by people. Thus, in order to fully understand their semantics, one needs to answer a fundamen-tal question: “how do people interact with these objects? ” As an initial step towards this goal, we offer a novel algorithm for auto-matically predicting a static pose that a person would need to adopt in order to use an object. Specifically, given an input 3D shape, the goal of our analysis is to predict a corresponding human pose, in-cluding contact points and kinematic parameters. This is especially challenging for man-made objects that commonly exhibit a lot of variance in their geometric structure. We address this challenge by observing that contact points usually share consistent local geomet-ric features related to the anthropometric properties of correspond-ing parts and that human body is subject to kinematic constraints and priors. Accordingly, our method effectively combines local re-gion classification and global kinematically-constrained search to successfully predict poses for various objects. We also evaluate our algorithm on six diverse collections of 3D polygonal models (chairs, gym equipment, cockpits, carts, bicycles, and bipedal de-vices) containing a total of 147 models. Finally, we demonstrate that the poses predicted by our algorithm can be used in several shape analysis problems, such as establishing correspondences be-tween objects, detecting salient regions, finding informative view-points, and retrieving functionally-similar shapes.
Robo brain: Large-scale knowledge engine for robots
, 2014
"... Abstract-In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, ro ..."
Abstract
-
Cited by 6 (6 self)
- Add to MetaCart
(Show Context)
Abstract-In this paper we introduce a knowledge engine, which learns and shares knowledge representations, for robots to carry out a variety of tasks. Building such an engine brings with it the challenge of dealing with multiple data modalities including symbols, natural language, haptic senses, robot trajectories, visual features and many others. The knowledge stored in the engine comes from multiple sources including physical interactions that robots have while performing tasks (perception, planning and control), knowledge bases from WWW and learned representations from leading robotics research groups. We discuss various technical aspects and associated challenges such as modeling the correctness of knowledge, inferring latent information and formulating different robotic tasks as queries to the knowledge engine. We describe the system architecture and how it supports different mechanisms for users and robots to interact with the engine. Finally, we demonstrate its use in three important research areas: grounding natural language, perception, and planning, which are the key building blocks for many robotic tasks. This knowledge engine is a collaborative effort and we call it RoboBrain.
Efficient Structured Parsing of Façades Using Dynamic Programming
"... We propose a sequential optimization technique for seg-menting a rectified image of a façade into semantic cate-gories. Our method retrieves a parsing which respects com-mon architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the conside ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
We propose a sequential optimization technique for seg-menting a rectified image of a façade into semantic cate-gories. Our method retrieves a parsing which respects com-mon architectural constraints and also returns a certificate for global optimality. Contrasting the suggested method, the considered façade labeling problem is typically tackled as a classification task or as grammar parsing. Both ap-proaches are not capable of fully exploiting the regularity of the problem. Therefore, our technique very significantly im-proves the accuracy compared to the state-of-the-art while being an order of magnitude faster. In addition, in 85 % of the test images we obtain a certificate for optimality. 1.
Discovering Object Functionality
"... Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recogni-tion by humans. In computer vision, most previous work on functionalit ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
Object functionality refers to the quality of an object that allows humans to perform some specific actions. It has been shown in psychology that functionality (affordance) is at least as essential as appearance in object recogni-tion by humans. In computer vision, most previous work on functionality either assumes exactly one functionality for each object, or requires detailed annotation of human pos-es and objects. In this paper, we propose a weakly super-vised approach to discover all possible object functionali-ties. Each object functionality is represented by a specific type of human-object interaction. Our method takes any possible human-object interaction into consideration, and evaluates image similarity in 3D rather than 2D in order to cluster human-object interactions more coherently. Ex-perimental results on a dataset of people interacting with musical instruments show the effectiveness of our approach. 1.
Physically Grounded Spatio-Temporal Object Affordances
"... Abstract. Objects in human environments support various functional-ities which govern how people interact with their environments in order to perform tasks. In this work, we discuss how to represent and learn a functional understanding of an environment in terms of object affor-dances. Such an under ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
(Show Context)
Abstract. Objects in human environments support various functional-ities which govern how people interact with their environments in order to perform tasks. In this work, we discuss how to represent and learn a functional understanding of an environment in terms of object affor-dances. Such an understanding is useful for many applications such as activity detection and assistive robotics. Starting with a semantic notion of affordances, we present a generative model that takes a given envi-ronment and human intention into account, and grounds the affordances in the form of spatial locations on the object and temporal trajectories in the 3D environment. The probabilistic model also allows uncertain-ties and variations in the grounded affordances. We apply our approach on RGB-D videos from Cornell Activity Dataset, where we first show that we can successfully ground the affordances, and we then show that learning such affordances improves performance in the labeling tasks.