Results 1 - 10
of
96
LabelMe: A Database and Web-Based Tool for Image Annotation
, 2008
"... We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sha ..."
Abstract
-
Cited by 232 (37 self)
- Add to MetaCart
We seek to build a large collection of images with ground truth labels to be used for object detection and recognition research. Such data is useful for supervised learning and quantitative evaluation. To achieve this, we developed a web-based tool that allows easy image annotation and instant sharing of such annotations. Using this annotation tool, we have collected a large dataset that spans many object categories, often containing multiple instances over a wide variety of images. We quantify the contents of the dataset and compare against existing state of the art datasets used for object recognition and detection. Also, we show how to extend the dataset to automatically enhance object labels with WordNet, discover object parts, recover a depth ordering of objects in a scene, and increase the number of labels using minimal user supervision and images from the web.
Object Detection with Discriminatively Trained Part Based Models
"... We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their ..."
Abstract
-
Cited by 171 (14 self)
- Add to MetaCart
We describe an object detection system based on mixtures of multiscale deformable part models. Our system is able to represent highly variable object classes and achieves state-of-the-art results in the PASCAL object detection challenges. While deformable part models have become quite popular, their value had not been demonstrated on difficult benchmarks such as the PASCAL datasets. Our system relies on new methods for discriminative training with partially labeled data. We combine a margin-sensitive approach for data-mining hard negative examples with a formalism we call latent SVM. A latent SVM is a reformulation of MI-SVM in terms of latent variables. A latent SVM is semi-convex and the training problem becomes convex once latent information is specified for the positive examples. This leads to an iterative training algorithm that alternates between fixing latent values for positive examples and optimizing the latent SVM objective function.
Discriminative models for multi-class object layout
"... Many state-of-the-art approaches for object recognition reduce the problem to a 0-1 classification task. Such reductions allow one to leverage sophisticated classifiers for learning. These models are typically trained independently for each class using positive and negative examples cropped from ima ..."
Abstract
-
Cited by 51 (5 self)
- Add to MetaCart
Many state-of-the-art approaches for object recognition reduce the problem to a 0-1 classification task. Such reductions allow one to leverage sophisticated classifiers for learning. These models are typically trained independently for each class using positive and negative examples cropped from images. At test-time, various post-processing heuristics such as non-maxima suppression (NMS) are required to reconcile multiple detections within and between different classes for each image. Though crucial to good performance on benchmarks, this post-processing is usually defined heuristically. We introduce a unified model for multi-class object recognition that casts the problem as a structured prediction task. Rather than predicting a binary label for each image
Photo clip art
- ACM Transactions on Graphics (SIGGRAPH
, 2007
"... Figure 1: Starting with a present day photograph of the famous Abbey Road in London (left), a person using our system was easily able to make the scene much more lively. There are 4 extra objects in the middle image, and 17 extra in the right image. Can you spot them all? We present a system for ins ..."
Abstract
-
Cited by 47 (15 self)
- Add to MetaCart
Figure 1: Starting with a present day photograph of the famous Abbey Road in London (left), a person using our system was easily able to make the scene much more lively. There are 4 extra objects in the middle image, and 17 extra in the right image. Can you spot them all? We present a system for inserting new objects into existing photographs by querying a vast image-based object library, precomputed using a publicly available Internet object database. The central goal is to shield the user from all of the arduous tasks typically involved in image compositing. The user is only asked to do two simple things: 1) pick a 3D location in the scene to place a new object; 2) select an object to insert using a hierarchical menu. We pose the problem of object insertion as a data-driven, 3D-based, context-sensitive object retrieval task. Instead of trying to manipulate the object to change its orientation, color distribution, etc. to fit the new image, we simply retrieve an object of a specified class that has all the required properties (camera pose, lighting, resolution, etc) from our large object library. We present new automatic algorithms for improving object segmentation and blending, estimating true 3D object size and orientation, and estimating scene lighting conditions. We also present an intuitive user interface that makes object insertion fast and simple even for the artistically challenged.
Auto-context and its Application to High-level Vision Tasks
- In Proc. CVPR
"... The notion of using context information for solving high-level vision and medical image segmentation problems has been increasingly realized in the field. However, how to learn an effective and efficient context model, together with an image appearance model, remains mostly unknown. The current lite ..."
Abstract
-
Cited by 40 (1 self)
- Add to MetaCart
The notion of using context information for solving high-level vision and medical image segmentation problems has been increasingly realized in the field. However, how to learn an effective and efficient context model, together with an image appearance model, remains mostly unknown. The current literature using Markov Random Fields (MRFs) and Conditional Random Fields (CRFs) often involves specific algorithm design, in which the modeling and computing stages are studied in isolation. In this paper, we propose the auto-context algorithm. Given a set of training images and their corresponding label maps, we first learn a classifier on local image patches. The discriminative probability (or classification confidence) maps created by the learned classifier are then used as context information, in addition to the original image patches, to train a new classifier. The algorithm then iterates until convergence. Auto-context integrates low-level and context information by fusing a large number of low-level appearance features with context and implicit shape information. The resulting discriminative algorithm is general and easy to implement. Under nearly the same parameter settings in training, we apply the algorithm to three challenging vision applications: foreground/background segregation, human body configuration estimation, and scene region labeling. Moreover, context also plays a very important role in medical/brain images where the anatomical structures are mostly constrained to relatively fixed positions. With only some slight changes resulting from using 3D instead of 2D features, the auto-context algorithm applied to brain MRI image segmentation is shown to outperform state-of-the-art algorithms specifically designed for this domain. Furthermore, the scope of the proposed algorithm goes beyond image analysis and it has the potential to be used for a wide variety of problems in multi-variate labeling.
3-D depth reconstruction from a single still image
, 2006
"... We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc ..."
Abstract
-
Cited by 38 (12 self)
- Add to MetaCart
We consider the task of 3-d depth estimation from a single still image. We take a supervised learning approach to this problem, in which we begin by collecting a training set of monocular images (of unstructured indoor and outdoor environments which include forests, sidewalks, trees, buildings, etc.) and their corresponding ground-truth depthmaps. Then, we apply supervised learning to predict the value of the depthmap as a function of the image. Depth estimation is a challenging problem, since local features alone are insufficient to estimate depth at a point, and one needs to consider the global context of the image. Our model uses a hierarchical, multiscale Markov Random Field (MRF) that incorporates multiscale local- and global-image features, and models the depths and the relation between depths at different points in the image. We show that, even on unstructured scenes, our algorithm is frequently able to recover fairly accurate depthmaps. We further propose a model that incorporates both monocular cues and stereo (triangulation) cues, to obtain significantly more accurate depth estimates than is possible using either monocular or stereo cues alone.
Learning Spatial Context: Using Stuff to Find Things
"... Abstract. The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally cl ..."
Abstract
-
Cited by 35 (1 self)
- Add to MetaCart
Abstract. The sliding window approach of detecting rigid objects (such as cars) is predicated on the belief that the object can be identified from the appearance in a small region around the object. Other types of objects of amorphous spatial extent (e.g., trees, sky), however, are more naturally classified based on texture or color. In this paper, we seek to combine recognition of these two types of objects into a system that leverages “context ” toward improving detection. In particular, we cluster image regions based on their ability to serve as context for the detection of objects. Rather than providing an explicit training set with region labels, our method automatically groups regions based on both their appearance and their relationships to the detections in the image. We show that our things and stuff (TAS) context model produces meaningful clusters that are readily interpretable, and helps improve our detection ability over state-of-the-art detectors. We also present a method for learning the active set of relationships for a particular dataset. We present results on object detection in images from the PASCAL VOC 2005/2006 datasets and on the task of overhead car detection in satellite images, demonstrating significant improvements over state-of-the-art detectors. 1
Make3D: Learning 3D Scene Structure from a Single Still Image
"... We consider the problem of estimating detailed 3-d structure from a single still image of an unstructured environment. Our goal is to create 3-d models which are both quantitatively accurate as well as visually pleasing. For each small homogeneous patch in the image, we use a Markov Random Field (M ..."
Abstract
-
Cited by 30 (8 self)
- Add to MetaCart
We consider the problem of estimating detailed 3-d structure from a single still image of an unstructured environment. Our goal is to create 3-d models which are both quantitatively accurate as well as visually pleasing. For each small homogeneous patch in the image, we use a Markov Random Field (MRF) to infer a set of “plane parameters” that capture both the 3-d location and 3-d orientation of the patch. The MRF, trained via supervised learning, models both image depth cues as well as the relationships between different parts of the image. Other than assuming that the environment is made up of a number of small planes, our model makes no explicit assumptions about the structure of the scene; this enables the algorithm to capture much more detailed 3-d structure than does prior art, and also give a much richer experience in the 3-d flythroughs created using image-based rendering, even for scenes with significant non-vertical structure. Using this approach, we have created qualitatively correct 3-d models for 64.9 % of 588 images downloaded from the internet. We have also extended our model to produce large scale 3d models from a few images.
Segmentation and recognition using structure from motion point clouds
- In ECCV
, 2008
"... Abstract. We propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. We motivate five simple cues designed to model specific patterns of motion and 3D world structure that vary with object category. We introduce features that project the 3D cues back to the 2 ..."
Abstract
-
Cited by 29 (8 self)
- Add to MetaCart
Abstract. We propose an algorithm for semantic segmentation based on 3D point clouds derived from ego-motion. We motivate five simple cues designed to model specific patterns of motion and 3D world structure that vary with object category. We introduce features that project the 3D cues back to the 2D image plane while modeling spatial layout and context. A randomized decision forest combines many such features to achieve a coherent 2D segmentation and recognize the object categories present. Our main contribution is to show how semantic segmentation is possible based solely on motion-derived 3D world structure. Our method works well on sparse, noisy point clouds, and unlike existing approaches, does not need appearance-based descriptors. Experiments were performed on a challenging new video database containing sequences filmed from a moving car in daylight and at dusk. The results confirm that indeed, accurate segmentation and recognition are possible using only motion and 3D world structure. Further, we show that the motion-derived information complements an existing state-of-the-art appearance-based method, improving both qualitative and quantitative performance. input video frame reconstructed 3D point cloud automatic segmentation Fig. 1. The proposed algorithm uses 3D point clouds estimated from videos such as the pictured driving sequence (with ground truth inset). Having trained on point clouds from other driving sequences, our new motion and structure features, based purely on the point cloud, perform 11-class semantic segmentation of each test frame. The colors in the ground truth and inferred segmentation indicate category labels. 2 1
Recovering Occlusion Boundaries from a Single Image
"... Occlusion reasoning, necessary for tasks such as navigation and object search, is an important aspect of everyday life and a fundamental problem in computer vision. We believe that the amazing ability of humans to reason about occlusions from one image is based on an intrinsically 3D interpretation. ..."
Abstract
-
Cited by 28 (6 self)
- Add to MetaCart
Occlusion reasoning, necessary for tasks such as navigation and object search, is an important aspect of everyday life and a fundamental problem in computer vision. We believe that the amazing ability of humans to reason about occlusions from one image is based on an intrinsically 3D interpretation. In this paper, our goal is to recover the occlusion boundaries and depth ordering of free-standing structures in the scene. Our approach is to learn to identify and label occlusion boundaries using the traditional edge and region cues together with 3D surface and depth cues. Since some of these cues require good spatial support (i.e., a segmentation), we gradually create larger regions and use them to improve inference over the boundaries. Our experiments demonstrate the power of a scene-based approach to occlusion reasoning. 1.

