Results 1 - 10
of
32
Spatial pyramid pooling in deep convolutional networks for visual recognition
- In ECCV
"... Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled poo ..."
Abstract
-
Cited by 52 (5 self)
- Add to MetaCart
Abstract. Existing deep convolutional neural networks (CNNs) require a fixed-size (e.g. 224×224) input image. This requirement is “artificial” and may hurt the recognition accuracy for the images or sub-images of an arbitrary size/scale. In this work, we equip the networks with a more principled pooling strategy, “spatial pyramid pooling”, to elimi-nate the above requirement. The new network structure, called SPP-net, can generate a fixed-length representation regardless of image size/scale. By removing the fixed-size limitation, we can improve all CNN-based image classification methods in general. Our SPP-net achieves state-of-the-art accuracy on the datasets of ImageNet 2012, Pascal VOC 2007, and Caltech101. The power of SPP-net is more significant in object detection. Using SPP-net, we compute the feature maps from the entire image only once, and then pool features in arbitrary regions (sub-images) to generate fixed-length representations for training the detectors. This method avoids repeatedly computing the convolutional features. In processing test im-ages, our method computes convolutional features 30-170 × faster than the recent leading method R-CNN (and 24-64 × faster overall), while achieving better or comparable accuracy on Pascal VOC 2007.1 1
CNN: Single-label to multi-label
- CoRR
"... Abstract—Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training ..."
Abstract
-
Cited by 6 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), where an arbitrary number of object segment hypotheses are taken as the inputs, then a shared CNN is connected with each hypothesis, and finally the CNN output results from different hypotheses are aggregated with max pooling to produce the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include: 1) no ground-truth bounding box information is required for training; 2) the whole HCP infrastructure is robust to possibly noisy and/or redundant hypotheses; 3) no explicit hypothesis label is required; 4) the shared CNN may be well pre-trained with a large-scale single-label image dataset, e.g. ImageNet; and 5) it may naturally output multi-label prediction results. Experimental results on Pascal VOC2007 and VOC2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts. In particular, the mAP reaches 84.2 % by HCP only and 90.3 % after the fusion with our complementary result in [47] based on hand-crafted features on the VOC2012 dataset, which significantly outperforms the state-of-the-arts with a large margin of more than 7%.
Encoding high dimensional local features by sparse coding based fisher vectors
- in Proc. Adv. Neural Inf. Process. Syst., 2014
"... Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for im-age classification. Most, if not all, FVC implementations employ the Gaussian mixture model (GMM) to characterize the generation process of l ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Deriving from the gradient vector of a generative model of local features, Fisher vector coding (FVC) has been identified as an effective coding method for im-age classification. Most, if not all, FVC implementations employ the Gaussian mixture model (GMM) to characterize the generation process of local features. This choice has shown to be sufficient for traditional low dimensional local fea-tures, e.g., SIFT; and typically, good performance can be achieved with only a few hundred Gaussian distributions. However, the same number of Gaussians is insuf-ficient to model the feature space spanned by higher dimensional local features, which have become popular recently. In order to improve the modeling capacity for high dimensional features, it turns out to be inefficient and computationally impractical to simply increase the number of Gaussians. In this paper, we propose a model in which each local feature is drawn from a Gaussian distribution whose mean vector is sampled from a subspace. With cer-tain approximation, this model can be converted to a sparse coding procedure and the learning/inference problems can be readily solved by standard sparse coding methods. By calculating the gradient vector of the proposed model, we derive a new fisher vector encoding strategy, termed Sparse Coding based Fisher Vec-tor Coding (SCFVC). Moreover, we adopt the recently developed Deep Convo-lutional Neural Network (CNN) descriptor as a high dimensional local feature and implement image classification with the proposed SCFVC. Our experimen-tal evaluations demonstrate that our method not only significantly outperforms the traditional GMM based Fisher vector encoding but also achieves the state-of-the-art performance in generic object recognition, indoor scene, and fine-grained image classification problems. 1
Discrete Graph Hashing
"... Hashing has emerged as a popular technique for fast nearest neighbor search in gi-gantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the perfor-mance of most unsupervised learning based hashing meth ..."
Abstract
-
Cited by 5 (2 self)
- Add to MetaCart
(Show Context)
Hashing has emerged as a popular technique for fast nearest neighbor search in gi-gantic databases. In particular, learning based hashing has received considerable attention due to its appealing storage and search efficiency. However, the perfor-mance of most unsupervised learning based hashing methods deteriorates rapidly as the hash code length increases. We argue that the degraded performance is due to inferior optimization procedures used to achieve discrete binary codes. This paper presents a graph-based unsupervised hashing model to preserve the neigh-borhood structure of massive data in a discrete code space. We cast the graph hashing problem into a discrete optimization framework which directly learns the binary codes. A tractable alternating maximization algorithm is then proposed to explicitly deal with the discrete constraints, yielding high-quality codes to well capture the local neighborhoods. Extensive experiments performed on four large datasets with up to one million samples show that our discrete optimization based graph hashing method obtains superior search accuracy over state-of-the-art un-supervised hashing methods, especially for longer codes. 1
Convolutional network features for scene recognition
- In ACM MM
, 2014
"... ABSTRACT Convolutional neural networks have recently been used to obtain record-breaking results in many vision benchmarks. In addition, the intermediate layer activations of a trained network when exposed to new data sources have been shown to perform very well as generic image features, even when ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
(Show Context)
ABSTRACT Convolutional neural networks have recently been used to obtain record-breaking results in many vision benchmarks. In addition, the intermediate layer activations of a trained network when exposed to new data sources have been shown to perform very well as generic image features, even when there are substantial differences between the original training data of the network and the new domain. In this paper, we focus on scene recognition and show that convolutional networks trained on mostly object recognition data can successfully be used for feature extraction in this task as well. We train a total of four networks with different training data and architectures, and show that the proposed method combining multiple scales and multiple features obtains state-ofthe-art performance on four standard scene datasets.
A discriminative cnn video representation for event detection
- In CVPR
, 2015
"... In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where on ..."
Abstract
-
Cited by 3 (1 self)
- Add to MetaCart
(Show Context)
In this paper, we propose a discriminative video rep-resentation for event detection over a large scale video dataset when only limited hardware resources are avail-able. The focus of this paper is to effectively leverage deep Convolutional Neural Networks (CNNs) to advance event detection, where only frame level static descriptors can be extracted by the existing CNN toolkit. This paper makes two contributions to the inference of CNN video representa-tion. First, while average pooling and max pooling have long been the standard approaches to aggregating frame level static features, we show that performance can be sig-nificantly improved by taking advantage of an appropriate encoding method. Second, we propose using a set of latent concept descriptors as the frame descriptor, which enriches visual information while keeping it computationally afford-able. The integration of the two contributions results in a new state-of-the-art performance in event detection over the largest video datasets. Compared to improved Dense Trajectories, which has been recognized as the best video representation for event detection, our new representation improves the Mean Average Precision (mAP) from 27.6% to 36.8 % for the TRECVID MEDTest 14 dataset and from 34.0 % to 44.6 % for the TRECVID MEDTest 13 dataset. 1. Introduction and Related
Deepid-net: Deformable deep convolutional neural networks for object detection
- In CVPR
, 2015
"... In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the def ..."
Abstract
-
Cited by 3 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we propose deformable deep convolutional neural networks for generic object detection. This new deep learning object detection framework has innovations in multiple aspects. In the proposed new deep architecture, a new deformation constrained pooling (def-pooling) layer models the deformation of object parts with geometric con-straint and penalty. A new pre-training strategy is proposed to learn feature representations more suitable for the object detection task and with good generalization capability. By changing the net structures, training strategies, adding and removing some key components in the detection pipeline, a set of models with large diversity are obtained, which significantly improves the effectiveness of model averag-ing. The proposed approach improves the mean averaged precision obtained by RCNN [14], which was the state-of-the-art, from 31 % to 50.3 % on the ILSVRC2014 detection test set. It also outperforms the winner of ILSVRC2014, GoogLeNet, by 6.1%. Detailed component-wise analysis is also provided through extensive experimental evaluation, which provide a global view for people to understand the deep learning object detection pipeline. 1.
Harvesting discriminative meta objects with deep cnn features for scene classification.
- In The IEEE International Conference on Computer Vision (ICCV),
, 2015
"... Abstract Recent work on scene classification still makes use of generic CNN features in a rudimentary manner. In this paper, we present a novel pipeline built upon deep CNN features to harvest discriminative visual objects and parts for scene classification. We first use a region proposal technique ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract Recent work on scene classification still makes use of generic CNN features in a rudimentary manner. In this paper, we present a novel pipeline built upon deep CNN features to harvest discriminative visual objects and parts for scene classification. We first use a region proposal technique to generate a set of high-quality patches potentially containing objects, and apply a pre-trained CNN to extract generic deep features from these patches. Then we perform both unsupervised and weakly supervised learning to screen these patches and discover discriminative ones representing category-specific objects and parts. We further apply discriminative clustering enhanced with local CNN finetuning to aggregate similar objects and parts into groups, called meta objects. A scene image representation is constructed by pooling the feature response maps of all the learned meta objects at multiple spatial scales. We have confirmed that the scene image representation obtained using this new pipeline is capable of delivering state-of-theart performance on two popular scene benchmark datasets, MIT Indoor 67
Deep filter banks for texture recognition and segmentation
- In CVPR
, 2015
"... Abstract Research in texture recognition often concentrates on the problem of material recognition in uncluttered conditions, an assumption rarely met by applications. In this work we conduct a first study of material and describable texture attributes recognition in clutter, using a new dataset de ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
(Show Context)
Abstract Research in texture recognition often concentrates on the problem of material recognition in uncluttered conditions, an assumption rarely met by applications. In this work we conduct a first study of material and describable texture attributes recognition in clutter, using a new dataset derived from the OpenSurface texture repository. Motivated by the challenge posed by this problem, we propose a new texture descriptor, FV-CNN, obtained by Fisher Vector pooling of a Convolutional Neural Network (CNN) filter bank. FV-CNN substantially improves the state-of-the-art in texture, material and scene recognition. Our approach achieves 79.8% accuracy on Flickr material dataset and 81% accuracy on MIT indoor scenes, providing absolute gains of more than 10% over existing approaches. FV-CNN easily transfers across domains without requiring feature adaptation as for methods that build on the fully-connected layers of CNNs. Furthermore, FV-CNN can seamlessly incorporate multiscale information and describe regions of arbitrary shapes and sizes. Our approach is particularly suited at localizing "stuff" categories and obtains state-of-the-art results on MSRC segmentation dataset, as well as promising results on recognizing materials and surface attributes in clutter on the OpenSurfaces dataset.
Rank Preserving Hashing for Rapid Image Search
"... In recent years, hashing techniques are becoming overwhelmingly popular for their high efficiency in handling large-scale computer vision applications. It has been shown that hashing techniques which leverage supervised information can significantly enhance perfor-mance, and thus greatly benefit vis ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
In recent years, hashing techniques are becoming overwhelmingly popular for their high efficiency in handling large-scale computer vision applications. It has been shown that hashing techniques which leverage supervised information can significantly enhance perfor-mance, and thus greatly benefit visual search tasks. Typically, a modern hashing method uses a set of hash functions to compress data samples into compact binary codes. However, few methods have developed hash functions to optimize the precision at the top of a ranking list based upon Hamming distances. In this paper, we propose a novel supervised hashing approach, namely Rank Preserving Hashing (RPH), to explicitly optimize the precision of Hamming distance ranking towards preserving the supervised rank information. The core idea is to train disciplined hash functions in which the mistakes at the top of a Hamming-distance ranking list are penalized more than those at the bottom. To find such hash functions, we relax the original discrete optimization objective to a continuous surrogate, and then design an online learning algorithm to efficiently optimize the surrogate objective. Empirical studies based upon two benchmark image datasets demonstrate that the pro-posed hashing approach achieves superior image search accuracy over the state-of-the-art approaches.