Results 1 
5 of
5
Towards Unified Depth and Semantic Prediction from a Single Image
"... Depth estimation and semantic segmentation are two fundamental problems in image understanding. While the two tasks are strongly correlated and mutually beneficial, they are usually solved separately or sequentially. Motivated by the complementary properties of the two tasks, we propose a unified f ..."
Abstract

Cited by 1 (1 self)
 Add to MetaCart
(Show Context)
Depth estimation and semantic segmentation are two fundamental problems in image understanding. While the two tasks are strongly correlated and mutually beneficial, they are usually solved separately or sequentially. Motivated by the complementary properties of the two tasks, we propose a unified framework for joint depth and semantic prediction. Given an image, we first use a trained Convolutional Neural Network (CNN) to jointly predict a global layout composed of pixelwise depth values and semantic labels. By allowing for interactions between the depth and semantic information, the joint network provides more accurate depth prediction than a stateoftheart CNN trained solely for depth prediction [6]. To further obtain finelevel details, the image is decomposed into local segments for regionlevel depth and semantic prediction under the guidance of global layout. Utilizing the pixelwise global prediction and regionwise local prediction, we formulate the inference problem in a twolayer Hierarchical Conditional Random Field (HCRF) to produce the final depth and semantic map. As demonstrated in the experiments, our approach effectively leverages the advantages of both tasks and provides the stateoftheart results. 1.
Deep Hierarchical Parsing for Semantic Segmentation
, 2015
"... This paper proposes a learningbased approach to scene parsing inspired by the deep Recursive Context Propagation Network (RCPN). RCPN is a deep feedforward neural network that utilizes the contextual information from the entire image, through bottomup followed by topdown context propagation via ..."
Abstract

Cited by 1 (0 self)
 Add to MetaCart
(Show Context)
This paper proposes a learningbased approach to scene parsing inspired by the deep Recursive Context Propagation Network (RCPN). RCPN is a deep feedforward neural network that utilizes the contextual information from the entire image, through bottomup followed by topdown context propagation via random binary parse trees. This improves the feature representation of every superpixel in the image for better classification into semantic categories. We analyze RCPN and propose two novel contributions to further improve the model. We first analyze the learning of RCPN parameters and discover the presence of bypass error paths in the computation graph of RCPN that can hinder contextual propagation. We propose to tackle this problem by including the classification loss of the internal nodes of the random parse trees in the original RCPN loss function. Secondly, we use an MRF on the parse tree nodes to model the hierarchical dependency present in the output. Both modifications provide performance boosts over the original RCPN and the new system achieves stateoftheart performance on Stanford Background, SIFTFlow and Daimler urban datasets.
Learning to Segment Under Various Forms of Weak Supervision Supplementary Material
"... In this section we prove Proposition 3.1 in the main paper. We start by presenting some preliminaries that are necessary for the proof. We refer the reader to [3] for more details. Definition 1.1. [3] A matrix A is totally unimodular (TU), iff the determinants of all square submatrices of A are eith ..."
Abstract
 Add to MetaCart
In this section we prove Proposition 3.1 in the main paper. We start by presenting some preliminaries that are necessary for the proof. We refer the reader to [3] for more details. Definition 1.1. [3] A matrix A is totally unimodular (TU), iff the determinants of all square submatrices of A are either −1, 0, or 1. Theorem 1.2. [3] A (0,+1,−1) matrix A is totally unimodular if both of the following conditions are satisfied: • Each column contains at most two nonzero elements • The rows of A can be partitioned into two sets A1 and A2 such that two nonzero entries in a column are in the same set of rows if they have different signs and in different sets of rows if they have the same sign. Corollary 1.3. [3] A (0,+1,−1) matrix A is totally unimodular if it contains no more than one +1 and no more than one −1 in each column. Theorem 1.4. [4, 1, 2] If A is totally unimodular and b is integral, then solving linear programs of form {min cTx  Ax = b,x ≥ 0} have integral optima, for any c. The main idea of our proof is to show that the matrix describing our linear constraints is totally unimodular. Employing Theorem 1.4 we then know that the LP relaxation gives integral optima since the right hand side is integral in our optimization problem (Eq. (8) in the main paper). Given that our inference problem is fully decomposable with respect to images, we first decompose it into small LPs, one for each image. More formally, for each image i, we have, max
Image Parsing with a Wide Range of Classes and SceneLevel Context
"... This paper presents a nonparametric scene parsing approach that improves the overall accuracy, as well as the coverage of foreground classes in scene images. We first improve the label likelihood estimates at superpixels by merging likelihood scores from different probabilistic classifiers. This ..."
Abstract
 Add to MetaCart
(Show Context)
This paper presents a nonparametric scene parsing approach that improves the overall accuracy, as well as the coverage of foreground classes in scene images. We first improve the label likelihood estimates at superpixels by merging likelihood scores from different probabilistic classifiers. This boosts the classification performance and enriches the representation of lessrepresented classes. Our second contribution consists of incorporating semantic context in the parsing process through global label costs. Our method does not rely on image retrieval sets but rather assigns a global likelihood estimate to each label, which is plugged into the overall energy function. We evaluate our system on two largescale datasets, SIFTflow and LMSun. We achieve stateoftheart performance on the SIFTflow dataset and nearrecord results on LMSun. 1.
Sensor Fusion for Semantic Segmentation of Urban Scenes
"... Abstract—Semantic understanding of environments is an important problem in robotics in general and intelligent autonomous systems in particular. In this paper, we propose a semantic segmentation algorithm which effectively fuses information from images and 3D point clouds. The proposed method inco ..."
Abstract
 Add to MetaCart
(Show Context)
Abstract—Semantic understanding of environments is an important problem in robotics in general and intelligent autonomous systems in particular. In this paper, we propose a semantic segmentation algorithm which effectively fuses information from images and 3D point clouds. The proposed method incorporates information from multiple scales in an intuitive and effective manner. A latefusion architecture is proposed to maximally leverage the training data in each modality. Finally, a pairwise Conditional Random Field (CRF) is used as a postprocessing step to enforce spatial consistency in the structured prediction. The proposed algorithm is evaluated on the publicly available KITTI dataset [1] [2], augmented with additional pixel and pointwise semantic labels for building, sky, road, vegetation, sidewalk, car, pedestrian, cyclist, sign/pole, and fence regions. A perpixel accuracy of 89.3 % and average class accuracy of 65.4 % is achieved, well above current stateoftheart [3]. I.