Results 1 - 10
of
159
Perceptual Organization and Recognition of Indoor Scenes from RGB-D Images
"... We address the problems of contour detection, bottomup grouping and semantic segmentation using RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset [27]. We propose algorithms for object boundar ..."
Abstract
-
Cited by 48 (3 self)
- Add to MetaCart
(Show Context)
We address the problems of contour detection, bottomup grouping and semantic segmentation using RGB-D data. We focus on the challenging setting of cluttered indoor scenes, and evaluate our approach on the recently introduced NYU-Depth V2 (NYUD2) dataset [27]. We propose algorithms for object boundary detection and hierarchical segmentation that generalize the gP b − ucm approach of [2] by making effective use of depth information. We show that our system can label each contour with its type (depth, normal or albedo). We also propose a generic method for long-range amodal completion of surfaces and show its effectiveness in grouping. We then turn to the problem of semantic segmentation and propose a simple approach that classifies superpixels into the 40 dominant object categories in NYUD2. We use both generic and class-specific features to encode the appearance and geometry of objects. We also show how our approach can be used for scene classification, and how this contextual information in turn improves object recognition. In all of these tasks, we report significant improvements over the state-of-the-art. 1.
Microsoft COCO: Common Objects in Context
"... Abstract. We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understand-ing. This is achieved by gathering images of complex everyday scenes containing common obj ..."
Abstract
-
Cited by 43 (3 self)
- Add to MetaCart
(Show Context)
Abstract. We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understand-ing. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in precise object localization. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old. With a total of 2.5 million labeled in-stances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detec-tion, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model. 1
Fully convolutional networks for semantic segmentation
, 2014
"... Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
(Show Context)
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolu-tional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [17], the VGG net [28], and GoogLeNet [29]) into fully convolu-tional networks and transfer their learned representations by fine-tuning [2] to the segmentation task. We then de-fine a novel architecture that combines semantic informa-tion from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20 % rela-tive improvement to 62.2 % mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image. 1.
Enhanced computer vision with microsoft kinect sensor: A review
- IEEE TRANSACTIONS ON CYBERNETICS
, 2013
"... With the invention of the low-cost Microsoft Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use. The complementary nature of the depth and visual information provided by the Kinect sensor opens up new opportunities to solve fundamental problems in ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
With the invention of the low-cost Microsoft Kinect sensor, high-resolution depth and visual (RGB) sensing has become available for widespread use. The complementary nature of the depth and visual information provided by the Kinect sensor opens up new opportunities to solve fundamental problems in computer vision. This paper presents a comprehensive review of recent Kinect-based computer vision algorithms and applications. The reviewed approaches are classified according to the type of vision problems that can be addressed or enhanced by means of the Kinect sensor. The covered topics include preprocessing, object tracking and recognition, human activity analysis, hand gesture analysis, and indoor 3-D mapping. For each category of methods, we outline their main algorithmic contributions and summarize their advantages/differences compared to their RGB counterparts. Finally, we give an overview of the challenges in this field and future research trends. This paper is expected to serve as a tutorial and source of references for Kinect-based computer vision researchers.
Learning Rich Features from RGB-D Images for Object Detection and Segmentation: Supplementary Material
"... In this subsection, we present the Precision Recall curves on the NYUD2 test set, comparing the output from our object detectors with that from RGB DPMs [1], ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
In this subsection, we present the Precision Recall curves on the NYUD2 test set, comparing the output from our object detectors with that from RGB DPMs [1],
Hallucinated Humans as the Hidden Context for Labeling 3D Scenes
"... For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated onl ..."
Abstract
-
Cited by 30 (15 self)
- Add to MetaCart
(Show Context)
For scene understanding, one popular approach has been to model the object-object relationships. In this paper, we hypothesize that such relationships are only an artifact of certain hidden factors, such as humans. For example, the objects, monitor and keyboard, are strongly spatially correlated only because a human types on the keyboard while watching the monitor. Our goal is to learn this hidden human context (i.e., the human-object relationships), and also use it as a cue for labeling the scenes. We present Infinite Factored Topic Model (IFTM), where we consider a scene as being generated from two types of topics: human configurations and human-object relationships. This enables our algorithm to hallucinate the possible configurations of the humans in the scene parsimoniously. Given only a dataset of scenes containing objects but not humans, we show that our algorithm can recover the human object relationships. We then test our algorithm on the task of attribute and object labeling in 3D scenes and show consistent improvements over the state-of-the-art.
Indoor semantic segmentation using depth information
- ICLR
"... This work addresses multi-class segmentation of indoor scenes with RGB-D in-puts. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and th ..."
Abstract
-
Cited by 28 (3 self)
- Add to MetaCart
(Show Context)
This work addresses multi-class segmentation of indoor scenes with RGB-D in-puts. While this area of research has gained much attention recently, most works still rely on hand-crafted features. In contrast, we apply a multiscale convolutional network to learn features directly from the images and the depth information. We obtain state-of-the-art on the NYU-v2 depth dataset with an accuracy of 64.5%. We illustrate the labeling of indoor scenes in videos sequences that could be pro-cessed in real-time using appropriate hardware such as an FPGA. 1
Depth map prediction from a single image using a multi-scale deep network
- NIPS
"... Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring in-tegration of both global and local information from various ..."
Abstract
-
Cited by 26 (2 self)
- Add to MetaCart
(Show Context)
Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring in-tegration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation. 1
3D-based reasoning with blocks, support, and stability
- IN CVPR
, 2013
"... 3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other o ..."
Abstract
-
Cited by 23 (5 self)
- Add to MetaCart
(Show Context)
3D volumetric reasoning is important for truly understanding a scene. Humans are able to both segment each object in an image, and perceive a rich 3D interpretation of the scene, e.g., the space an object occupies, which objects support other objects, and which objects would, if moved, cause other objects to fall. We propose a new approach for parsing RGB-D images using 3D block units for volumetric reasoning. The algorithm fits image segments with 3D blocks, and iteratively evaluates the scene based on block interaction properties. We produce a 3D representation of the scene based on jointly optimizing over segmentations, block fitting, supporting relations, and object stability. Our algorithm incorporates the intuition that a good 3D representation of the scene is the one that fits the data well, and is a stable, self-supporting (i.e., one that does not topple) arrangement of objects. We experiment on several datasets including controlled and real indoor scenarios. Results show that our stability-reasoning framework improves RGB-D segmentation and scene volumetric representation.
A.: SUN3D: A database of big spaces reconstructed using sfm and object labels
- In: ICCV. (2013
"... Existing scene understanding datasets contain only a limited set of views of a place, and they lack representations of complete 3D spaces. In this paper, we introduce SUN3D, a large-scale RGB-D video database with camera pose and object labels, capturing the full 3D extent of many places. The tasks ..."
Abstract
-
Cited by 17 (8 self)
- Add to MetaCart
(Show Context)
Existing scene understanding datasets contain only a limited set of views of a place, and they lack representations of complete 3D spaces. In this paper, we introduce SUN3D, a large-scale RGB-D video database with camera pose and object labels, capturing the full 3D extent of many places. The tasks that go into constructing such a dataset are diffi-cult in isolation – hand-labeling videos is painstaking, and structure from motion (SfM) is unreliable for large spaces. But if we combine them together, we make the dataset con-struction task much easier. First, we introduce an intuitive labeling tool that uses a partial reconstruction to propa-gate labels from one frame to another. Then we use the object labels to fix errors in the reconstruction. For this, we introduce a generalization of bundle adjustment that incor-porates object-to-object correspondences. This algorithm works by constraining points for the same object from dif-ferent frames to lie inside a fixed-size bounding box, pa-rameterized by its rotation and translation. The SUN3D database, the source code for the generalized bundle adjust-ment, and the web-based 3D annotation tool are all avail-able at