Results 1 - 10
of
21
People Watching -- Human Actions as a Cue for Single View Geometry
"... We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These c ..."
Abstract
-
Cited by 19 (4 self)
- Add to MetaCart
We present an approach which exploits the coupling between human actions and scene geometry to use human pose as a cue for single-view 3D scene un-derstanding. Our method builds upon recent advances in still-image pose estimation to extract functional and geometric constraints on the scene. These constraints are then used to improve single-view 3D scene under-standing approaches. The proposed method is validated on monocular time-lapse sequences from YouTube and still images of indoor scenes gathered from the Inter-net. We demonstrate that observing people performing different actions can significantly improve estimates of 3D scene geometry.
3D Scene Understanding by Voxel-CRF
"... Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certain ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of partial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1 and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images. 1.
Unfolding an Indoor Origami World
"... Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other constraints. Our method takes a variety of cues and uses them to infer a consistent interpretation of the scene. We demon-strate improvements over the state-of-the art and produce interpretations of the scene that link large planar surfaces. 1
FPM: Fine pose Parts-based Model with 3D CAD models
"... Abstract. We introduce a novel approach to the problem of localizing objects in an image and estimating their fine-pose. Given exact CAD models, and a few real training images with aligned models, we propose to leverage the geometric information from CAD models and appearance information from real i ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We introduce a novel approach to the problem of localizing objects in an image and estimating their fine-pose. Given exact CAD models, and a few real training images with aligned models, we propose to leverage the geometric information from CAD models and appearance information from real images to learn a model that can accurately esti-mate fine pose in real images. Specifically, we propose FPM, a fine pose parts-based model, that combines geometric information in the form of shared 3D parts in deformable part based models, and appearance infor-mation in the form of objectness to achieve both fast and accurate fine pose estimation. Our method significantly outperforms current state-of-the-art algorithms in both accuracy and speed. 1
3DNN: Viewpoint Invariant 3D Geometry Matching for Scene Understanding
"... We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D mode ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
We present a new algorithm 3DNN (3D Nearest-Neighbor), which is capable of matching an image with 3D data, independently of the viewpoint from which the image was captured. By leveraging rich annotations associated with each image, our algorithm can automatically produce precise and detailed 3D models of a scene from a single image. Moreover, we can transfer information across images to accurately label and segment objects in a scene. The true benefit of 3DNN compared to a traditional 2D nearest-neighbor approach is that by generalizing across viewpoints, we free ourselves from the need to have training examples captured from all possible viewpoints. Thus, we are able to achieve comparable results using orders of magnitude less data, and recognize objects from never-beforeseen viewpoints. In this work, we describe the 3DNN algorithm and rigorously evaluate its performance for the tasks of geometry estimation and object detection/segmentation. By decoupling the viewpoint and the geometry of an image, we develop a scene matching approach which is truly 100% viewpoint invariant, yielding state-of-the-art performance on challenging data. 1.
Looking Beyond the Visible Scene
"... A common thread that ties together many prior works in scene understanding is their focus on the aspects directly present in a scene such as its categorical classification or the set of objects. In this work, we propose to look beyond the visible elements of a scene; we demonstrate that a scene is n ..."
Abstract
-
Cited by 5 (0 self)
- Add to MetaCart
(Show Context)
A common thread that ties together many prior works in scene understanding is their focus on the aspects directly present in a scene such as its categorical classification or the set of objects. In this work, we propose to look beyond the visible elements of a scene; we demonstrate that a scene is not just a collection of objects and their configuration or the labels assigned to its pixels- it is so much more. From a simple observation of a scene, we can tell a lot about the environment surrounding the scene such as the potential es-tablishments near it, the potential crime rate in the area, or even the economic climate. Here, we explore several of these aspects from both the human perception and computer vision perspective. Specifically, we show that it is possible to predict the distance of surrounding establishments such as McDonald’s or hospitals even by using scenes located far from them. We go a step further to show that both humans and computers perform well at navigating the environment based only on visual cues from scenes. Lastly, we show that it is possible to predict the crime rates in an area simply by looking at a scene without any real-time criminal activ-ity. Simply put, here, we illustrate that it is possible to look beyond the visible scene. 1.
Creating Consistent Scene Graphs Using a Probabilistic Grammar
"... Figure 1: Our algorithm processes raw scene graphs with possible over-segmentation (a), obtained from repositories such as the Trimble Warehouse, into consistent hierarchies capturing semantic and functional groups (b,c). The hierarchies are inferred by parsing the scene geometry with a probabilisti ..."
Abstract
-
Cited by 4 (3 self)
- Add to MetaCart
Figure 1: Our algorithm processes raw scene graphs with possible over-segmentation (a), obtained from repositories such as the Trimble Warehouse, into consistent hierarchies capturing semantic and functional groups (b,c). The hierarchies are inferred by parsing the scene geometry with a probabilistic grammar learned from a set of annotated examples. Apart from generating meaningful groupings at multiple scales, our algorithm also produces object labels with higher accuracy compared to alternative approaches. Growing numbers of 3D scenes in online repositories provide new opportunities for data-driven scene understanding, editing, and syn-thesis. Despite the plethora of data now available online, most of it cannot be effectively used for data-driven applications because it lacks consistent segmentations, category labels, and/or functional groupings required for co-analysis. In this paper, we develop algo-rithms that infer such information via parsing with a probabilistic grammar learned from examples. First, given a collection of scene graphs with consistent hierarchies and labels, we train a probabilis-tic hierarchical grammar to represent the distributions of shapes,
PanoContext: A Whole-room 3D Context Model for Panoramic Scene Understanding
"... Abstract. The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360◦ full-view panoramas in scene understanding, and propose a whole- ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Abstract. The field-of-view of standard cameras is very small, which is one of the main reasons that contextual information is not as useful as it should be for object detection. To overcome this limitation, we advocate the use of 360◦ full-view panoramas in scene understanding, and propose a whole-room context model in 3D. For an input panorama, our method outputs 3D bounding boxes of the room and all major objects inside, together with their semantic categories. Our method generates 3D hypotheses based on contextual constraints and ranks the hypotheses holistically, combining both bottom-up and top-down context infor-mation. To train our model, we construct an annotated panorama dataset and re-construct the 3D model from single-view using manual annotation. Experiments show that solely based on 3D context without any image-based object detector, we can achieve a comparable performance with the state-of-the-art object detec-tor. This demonstrates that when the FOV is large, context is as powerful as object appearance. All data and source code are available online. 1
Layout Estimation of Highly Cluttered Indoor Scenes Using Geometric and Semantic Cues
"... Abstract. Recovering the spatial layout of cluttered indoor scenes is a challenging problem. Current methods generate layout hypotheses from vanishing point estimates produced using 2D image features. This method fails in highly cluttered scenes in which most of the image features come from clutter ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
Abstract. Recovering the spatial layout of cluttered indoor scenes is a challenging problem. Current methods generate layout hypotheses from vanishing point estimates produced using 2D image features. This method fails in highly cluttered scenes in which most of the image features come from clutter instead of the room’s geometric structure. In this paper, we propose to use human detections as cues to more accurately estimate the vanishing points. Our method is built on top of the fact that people are often the focus of indoor scenes, and that the scene and the people within the scene should have consistent geometric configurations in 3D space. We contribute a new data set of highly cluttered indoor scenes containing people, on which we provide baselines and evaluate our method. This evaluation shows that our approach improves 3D interpretation of scenes.
Joint embeddings of shapes and images via cnn image purification
- ACM TRANS. GRAPH
, 2015
"... Both 3D models and 2D images contain a wealth of information about everyday objects in our environment. However, it is difficult to semantically link together these two media forms, even when they feature identical or very similar objects. We propose a joint embedding space populated by both 3D shap ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
Both 3D models and 2D images contain a wealth of information about everyday objects in our environment. However, it is difficult to semantically link together these two media forms, even when they feature identical or very similar objects. We propose a joint embedding space populated by both 3D shapes and 2D images of objects, where the distances between embedded entities reflect similarity between the underlying objects. This joint embedding space facilitates comparison between entities of either form, and allows for cross-modality retrieval. We construct the embedding space us-ing 3D shape similarity measure, as 3D shapes are more pure and complete than their appearance in images, leading to more robust distance metrics. We then employ a Convolutional Neural Network (CNN) to “purify ” images by muting distracting factors. The CNN is trained to map an image to a point in the embedding space, so that it is close to a point attributed to a 3D model of a similar object to the one depicted in the image. This purifying capability of the CNN is accomplished with the help of a large amount of training data consisting of images synthesized from 3D shapes. Our joint embedding allows cross-view image retrieval, image-based shape retrieval, as well as shape-based image retrieval. We evaluate our method on these retrieval tasks and show that it consistently out-performs state-of-the-art methods, and demonstrate the usability of a joint embedding in a number of additional applications.