Results 1 - 10
of
165
Rich feature hierarchies for accurate object detection and semantic segmentation
"... Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scala ..."
Abstract
-
Cited by 251 (23 self)
- Add to MetaCart
(Show Context)
Object detection performance, as measured on the canonical PASCAL VOC dataset, has plateaued in the last few years. The best-performing methods are complex en-semble systems that typically combine multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm that im-proves mean average precision (mAP) by more than 30% relative to the previous best result on VOC 2012—achieving a mAP of 53.3%. Our approach combines two key insights: (1) one can apply high-capacity convolutional neural net-works (CNNs) to bottom-up region proposals in order to localize and segment objects and (2) when labeled training data is scarce, supervised pre-training for an auxiliary task, followed by domain-specific fine-tuning, yields a significant performance boost. Since we combine region proposals with CNNs, we call our method R-CNN: Regions with CNN features. Source code for the complete system is available at
Caffe: Convolutional architecture for fast feature embedding
, 2014
"... Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose conv ..."
Abstract
-
Cited by 192 (8 self)
- Add to MetaCart
(Show Context)
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep mod-els efficiently on commodity architectures. Caffe fits indus-try and internet-scale media needs by CUDA GPU computa-tion, processing over 40 million images a day on a single K40 or Titan GPU ( ≈ 2.5 ms per image). By separating model representation from actual implementation, Caffe allows ex-perimentation and seamless switching among platforms for ease of development and deployment from prototyping ma-chines to cloud environments. Caffe is maintained and developed by the Berkeley Vi-sion and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers on-going research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.
Overfeat: Integrated recognition, localization and detection using convolutional networks
- http://arxiv.org/abs/1312.6229
"... ar ..."
(Show Context)
Going deeper with convolutions
, 2014
"... We propose a deep convolutional neural network architecture codenamed Incep-tion, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improv ..."
Abstract
-
Cited by 65 (2 self)
- Add to MetaCart
We propose a deep convolutional neural network architecture codenamed Incep-tion, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.
Multiscale combinatorial grouping
- In: CVPR (2014
"... We propose a unified approach for bottom-up hierarchical image segmentation and object candidate generation for recognition, called Multiscale Combinatorial Grouping (MCG). For this purpose, we first develop a fast normalized cuts algorithm. We then propose a high-performance hierarchical segmenter ..."
Abstract
-
Cited by 56 (11 self)
- Add to MetaCart
(Show Context)
We propose a unified approach for bottom-up hierarchical image segmentation and object candidate generation for recognition, called Multiscale Combinatorial Grouping (MCG). For this purpose, we first develop a fast normalized cuts algorithm. We then propose a high-performance hierarchical segmenter that makes effective use of multiscale information. Finally, we propose a grouping strategy that combines our multiscale regions into highly-accurate object candidates by exploring efficiently their combinatorial space. We conduct extensive experiments on both the BSDS500 and on the PASCAL 2012 segmentation datasets, showing that MCG produces state-of-the-art contours, hierarchical regions and object candidates. 1.
A.: Blocks that shout: Distinctive parts for scene classification
, 2013
"... The automatic discovery of distinctive parts for an ob-ject or scene class is challenging since it requires simulta-neously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We ad-dress this ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
(Show Context)
The automatic discovery of distinctive parts for an ob-ject or scene class is challenging since it requires simulta-neously to learn the part appearance and also to identify the part occurrences in images. In this paper, we propose a simple, efficient, and effective method to do so. We ad-dress this problem by learning parts incrementally, starting from a single part occurrence with an Exemplar SVM. In this manner, additional part instances are discovered and aligned reliably before being considered as training exam-ples. We also propose entropy-rank curves as a means of evaluating the distinctiveness of parts shareable between categories and use them to select useful parts out of a set of candidates. We apply the new representation to the task of scene cat-egorisation on the MIT Scene 67 benchmark. We show that our method can learn parts which are significantly more in-formative and for a fraction of the cost, compared to previ-ous part-learning methods such as Singh et al. [28]. We also show that a well constructed bag of words or Fisher vector model can substantially outperform the previous state-of-the-art classification performance on this data. 1.
Spatial pyramid pooling in deep convolutional networks for visual recognition
- In ECCV
"... Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled poo ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
(Show Context)
Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to elimi-nate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test im-ages, our method computes convolutional features 30-170 × faster than the recent leading method R-CNN (and 24-64 × faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.1 1
Simultaneous Detection and Segmentation
"... Abstract. We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical sem ..."
Abstract
-
Cited by 46 (10 self)
- Add to MetaCart
(Show Context)
Abstract. We aim to detect all instances of a category in an image and, for each instance, mark the pixels that belong to it. We call this task Simultaneous Detection and Segmentation (SDS). Unlike classical bounding box detection, SDS requires a segmentation and not just a box. Unlike classical semantic segmentation, we require individual object instances. We build on recent work that uses convolutional neural networks to classify category-independent region proposals (R-CNN [16]), introducing a novel architecture tailored for SDS. We then use category-specific, topdown figure-ground predictions to refine our bottom-up proposals. We show a 7 point boost (16 % relative) over our baselines on SDS, a 5 point boost (10 % relative) over state-of-the-art on semantic segmentation, and state-of-the-art performance in object detection. Finally, we provide diagnostic tools that unpack performance and provide directions for future work.
Regionlets for generic object detection
, 2013
"... Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to eval-uate for many locations. In view of this, we propose to mod ..."
Abstract
-
Cited by 45 (6 self)
- Add to MetaCart
(Show Context)
Generic object detection is confronted by dealing with different degrees of variations in distinct object classes with tractable computations, which demands for descriptive and flexible object representations that are also efficient to eval-uate for many locations. In view of this, we propose to model an object class by a cascaded boosting classifier which integrates various types of features from competing local regions, named as regionlets. A regionlet is a base feature extraction region defined proportionally to a detec-tion window at an arbitrary resolution (i.e. size and as-pect ratio). These regionlets are organized in small groups with stable relative positions to delineate fine-grained spa-tial layouts inside objects. Their features are aggregated to a one-dimensional feature within one group so as to tol-erate deformations. Then we evaluate the object bound-ing box proposal in selective search from segmentation cues, limiting the evaluation locations to thousands. Our approach significantly outperforms the state-of-the-art on popular multi-class detection benchmark datasets with a single method, without any contexts. It achieves the detec-tion mean average precision of 41.7 % on the PASCAL VOC 2007 dataset and 39.7 % on the VOC 2010 for 20 object cat-egories. It achieves 14.7 % mean average precision on the ImageNet dataset for 200 object categories, outperforming the latest deformable part-based model (DPM) by 4.7%. 1.