Results 1 - 10
of
48
Fully convolutional networks for semantic segmentation
, 2014
"... Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
(Show Context)
Convolutional networks are powerful visual models that yield hierarchies of features. We show that convolu-tional networks by themselves, trained end-to-end, pixels-to-pixels, exceed the state-of-the-art in semantic segmen-tation. Our key insight is to build “fully convolutional” networks that take input of arbitrary size and produce correspondingly-sized output with efficient inference and learning. We define and detail the space of fully convolu-tional networks, explain their application to spatially dense prediction tasks, and draw connections to prior models. We adapt contemporary classification networks (AlexNet [17], the VGG net [28], and GoogLeNet [29]) into fully convolu-tional networks and transfer their learned representations by fine-tuning [2] to the segmentation task. We then de-fine a novel architecture that combines semantic informa-tion from a deep, coarse layer with appearance information from a shallow, fine layer to produce accurate and detailed segmentations. Our fully convolutional network achieves state-of-the-art segmentation of PASCAL VOC (20 % rela-tive improvement to 62.2 % mean IU on 2012), NYUDv2, and SIFT Flow, while inference takes one third of a second for a typical image. 1.
Learning Rich Features from RGB-D Images for Object Detection and Segmentation: Supplementary Material
"... In this subsection, we present the Precision Recall curves on the NYUD2 test set, comparing the output from our object detectors with that from RGB DPMs [1], ..."
Abstract
-
Cited by 31 (2 self)
- Add to MetaCart
(Show Context)
In this subsection, we present the Precision Recall curves on the NYUD2 test set, comparing the output from our object detectors with that from RGB DPMs [1],
Parsing IKEA Objects: Fine Pose Estimation
"... 3D Model Original Image Fine-pose Estimation We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidat ..."
Abstract
-
Cited by 14 (1 self)
- Add to MetaCart
(Show Context)
3D Model Original Image Fine-pose Estimation We address the problem of localizing and estimating the fine-pose of objects in the image with exact 3D models. Our main focus is to unify contributions from the 1970s with recent advances in object detection: use local keypoint detectors to find candidate poses and score global alignment of each candidate pose to the image. Moreover, we also provide a new dataset containing fine-aligned objects with their exactly matched 3D models, and a set of models for widely used objects. We also evaluate our algorithm both on object detection and fine pose estimation, and show that our method outperforms state-of-the art algorithms. 1.
Sliding Shapes for 3D Object Detection in Depth Images
"... Abstract. The depth information of RGB-D sensors has greatly simplified some common challenges in computer vision and enabled breakthroughs for several tasks. In this paper, we propose to use depth maps for object detection and de-sign a 3D detector to overcome the major difficulties for recognition ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
Abstract. The depth information of RGB-D sensors has greatly simplified some common challenges in computer vision and enabled breakthroughs for several tasks. In this paper, we propose to use depth maps for object detection and de-sign a 3D detector to overcome the major difficulties for recognition, namely the variations of texture, illumination, shape, viewpoint, clutter, occlusion, self-occlusion and sensor noises. We take a collection of 3D CAD models and render each CAD model from hundreds of viewpoints to obtain synthetic depth maps. For each depth rendering, we extract features from the 3D point cloud and train an Exemplar-SVM classifier. During testing and hard-negative mining, we slide a
3D Scene Understanding by Voxel-CRF
"... Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certain ..."
Abstract
-
Cited by 13 (1 self)
- Add to MetaCart
(Show Context)
Scene understanding is an important yet very challenging problem in computer vision. In the past few years, researchers have taken advantage of the recent diffusion of depth-RGB (RGB-D) cameras to help simplify the problem of inferring scene semantics. However, while the added 3D geometry is certainly useful to segment out objects with different depth values, it also adds complications in that the 3D geometry is often incorrect because of noisy depth measurements and the actual 3D extent of the objects is usually unknown because of occlusions. In this paper we propose a new method that allows us to jointly refine the 3D reconstruction of the scene (raw depth values) while accurately segmenting out the objects or scene elements from the 3D reconstruction. This is achieved by introducing a new model which we called Voxel-CRF. The Voxel-CRF model is based on the idea of constructing a conditional random field over a 3D volume of interest which captures the semantic and 3D geometric relationships among different elements (voxels) of the scene. Such model allows to jointly estimate (1) a dense voxel-based 3D reconstruction and (2) the semantic labels associated with each voxel even in presence of partial occlusions using an approximate yet efficient inference strategy. We evaluated our method on the challenging NYU Depth dataset (Version 1 and 2). Experimental results show that our method achieves competitive accuracy in inferring scene semantics and visually appealing results in improving the quality of the 3D reconstruction. We also demonstrate an interesting application of object removal and scene completion from RGB-D images. 1.
M.: A multi-world approach to question answering about real-world scenes based on uncertain input
, 2014
"... We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain predictions by a multi-world approach that represents uncertainty about the perceived world i ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
We propose a method for automatically answering questions about images by bringing together recent advances from natural language processing and computer vision. We combine discrete reasoning with uncertain predictions by a multi-world approach that represents uncertainty about the perceived world in a bayesian framework. Our approach can handle human questions of high complexity about realistic scenes and replies with range of answer like counts, object classes, in-stances and lists of them. The system is directly trained from question-answer pairs. We establish a first benchmark for this task that can be seen as a modern attempt at a visual turing test. 1
A.: Hierarchical semantic labeling for task-relevant rgb-d perception
- In: RSS (2014
"... Abstract—Semantic labeling of RGB-D scenes is very impor-tant in enabling robots to perform mobile manipulation tasks, but different tasks may require entirely different sets of labels. For example, when navigating to an object, we may need only a single label denoting its class, but to manipulate i ..."
Abstract
-
Cited by 10 (7 self)
- Add to MetaCart
(Show Context)
Abstract—Semantic labeling of RGB-D scenes is very impor-tant in enabling robots to perform mobile manipulation tasks, but different tasks may require entirely different sets of labels. For example, when navigating to an object, we may need only a single label denoting its class, but to manipulate it, we might need to identify individual parts. In this work, we present an algorithm that produces hierarchical labelings of a scene, following is-part-of and is-type-of relationships. Our model is based on a Conditional Random Field that relates pixel-wise and pair-wise observations to labels. We encode hierarchical labeling constraints into the model while keeping inference tractable. Our model thus predicts different specificities in labeling based on its confidence—if it is not sure whether an object is Pepsi or Sprite, it will predict soda rather than making an arbitrary choice. In extensive experiments, both offline on standard datasets as well as in online robotic experiments, we show that our model outperforms other state-of-the-art methods in labeling performance as well as in success rate for robotic tasks. I.
Unfolding an Indoor Origami World
"... Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other ..."
Abstract
-
Cited by 8 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In this work, we present a method for single-view reasoning about 3D surfaces and their relationships. We propose the use of mid-level constraints for 3D scene understanding in the form of convex and concave edges and introduce a generic framework capable of incorporat-ing these and other constraints. Our method takes a variety of cues and uses them to infer a consistent interpretation of the scene. We demon-strate improvements over the state-of-the art and produce interpretations of the scene that link large planar surfaces. 1
R.: Estimating the 3D layout of indoor scenes and its clutter from depth sensors
- In: ICCV. (2013
"... In this paper we propose an approach to jointly estimate the layout of rooms as well as the clutter present in the scene using RGB-D data. Towards this goal, we propose an effec-tive model that is able to exploit both depth and appearance features, which are complementary. Furthermore, our ap-proach ..."
Abstract
-
Cited by 8 (2 self)
- Add to MetaCart
(Show Context)
In this paper we propose an approach to jointly estimate the layout of rooms as well as the clutter present in the scene using RGB-D data. Towards this goal, we propose an effec-tive model that is able to exploit both depth and appearance features, which are complementary. Furthermore, our ap-proach is efficient as we exploit the inherent decomposition of additive potentials. We demonstrate the effectiveness of our approach on the challenging NYU v2 dataset and show that employing depth reduces the layout error by 6 % and the clutter estimation by 13%. 1.
Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture,” arXiv preprint arXiv:1411.4734
, 2014
"... In this paper we address three different computer vision tasks using a single multiscale convolutional network archi-tecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifica-tions, regre ..."
Abstract
-
Cited by 7 (1 self)
- Add to MetaCart
(Show Context)
In this paper we address three different computer vision tasks using a single multiscale convolutional network archi-tecture: depth prediction, surface normal estimation, and semantic labeling. The network that we develop is able to adapt naturally to each task using only small modifica-tions, regressing from the input image to the output map di-rectly. Our method progressively refines predictions using a sequence of scales, and captures many image details with-out any superpixels or low-level segmentation. We achieve state-of-the-art performance on benchmarks for all three tasks. 1.