Results 1 - 10
of
266
Semantic Texton Forests for Image Categorization and Segmentation
"... We propose semantic texton forests, efficient and powerful new low-level features. These are ensembles of decision trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors. They are extremely fast to both train and tes ..."
Abstract
-
Cited by 304 (13 self)
- Add to MetaCart
We propose semantic texton forests, efficient and powerful new low-level features. These are ensembles of decision trees that act directly on image pixels, and therefore do not need the expensive computation of filter-bank responses or local descriptors. They are extremely fast to both train and test, especially compared with k-means clustering and nearest-neighbor assignment of feature descriptors. The nodes in the trees provide (i) an implicit hierarchical clustering into semantic textons, and (ii) an explicit local classification estimate. Our second contribution, the bag of semantic textons, combines a histogram of semantic textons over an image region with a region prior category distribution. The bag of semantic textons is computed over the whole image for categorization, and over local rectangular regions for segmentation. Including both histogram and region prior allows our segmentation algorithm to exploit both textural and semantic context. Our third contribution is an image-level prior for segmentation that emphasizes those categories that the automatic categorization believes to be present. We evaluate on two datasets including the very challenging VOC 2007 segmentation dataset. Our results significantly advance the state-of-the-art in segmentation accuracy, and furthermore, our use of efficient decision forests gives at least a five-fold increase in execution speed.
Action Recognition by Dense Trajectories
, 2011
"... Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense ..."
Abstract
-
Cited by 293 (16 self)
- Add to MetaCart
Feature trajectories have shown to be efficient for rep-resenting videos. Typically, they are extracted using the KLT tracker or matching SIFT descriptors between frames. However, the quality as well as quantity of these trajecto-ries is often not sufficient. Inspired by the recent success of dense sampling in image classification, we propose an approach to describe videos by dense trajectories. We sam-ple dense points from each frame and track them based on displacement information from a dense optical flow field. Given a state-of-the-art optical flow algorithm, our trajec-tories are robust to fast irregular motions as well as shot boundaries. Additionally, dense trajectories cover the mo-tion information in videos well. We, also, investigate how to design descriptors to encode the trajectory information. We introduce a novel descriptor based on motion boundary histograms, which is robust to camera motion. This descriptor consistently outperforms other state-of-the-art descriptors, in particular in uncon-trolled realistic videos. We evaluate our video description in the context of action classification with a bag-of-features approach. Experimental results show a significant improve-ment over the state of the art on four datasets of varying difficulty, i.e. KTH, YouTube, Hollywood2 and UCF sports.
An Analysis of Single-Layer Networks in Unsupervised Feature Learning
"... A great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several ..."
Abstract
-
Cited by 223 (19 self)
- Add to MetaCart
(Show Context)
A great deal of research has focused on algorithms for learning features from unlabeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning algorithms and deep models. In this paper, however, we show that several very simple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning algorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (“stride”) between extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itself—so critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-theart performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure itself, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6 % and 97.0 % accuracy respectively). 1
Context-aware saliency detection
- in [IEEE Conf. on Computer Vision and Pattern Recognition
, 2010
"... We propose a new type of saliency – context-aware saliency – which aims at detecting the image regions that represent the scene. This definition differs from previous definitions whose goal is to either identify fixation points or detect the dominant object. In accordance with our saliency definitio ..."
Abstract
-
Cited by 166 (7 self)
- Add to MetaCart
We propose a new type of saliency – context-aware saliency – which aims at detecting the image regions that represent the scene. This definition differs from previous definitions whose goal is to either identify fixation points or detect the dominant object. In accordance with our saliency definition, we present a detection algorithm which is based on four principles observed in the psychological literature. The benefits of the proposed approach are evaluated in two applications where the context of the dominant objects is just as essential as the objects themselves. In image retargeting we demonstrate that using our saliency prevents distortions in the important regions. In summarization we show that our saliency helps to produce compact, appealing, and informative summaries. 1.
F.: Fast discriminative visual codebooks using randomized clustering forests
- In: Advances in Neural Information Processing Systems
, 2006
"... Some of the most effective recent methods for content-based image classifica-tion work by extracting dense or sparse local image descriptors, quantizing them according to a coding rule such as k-means vector quantization, accumulating his-tograms of the resulting “visual word ” codes over the image, ..."
Abstract
-
Cited by 156 (4 self)
- Add to MetaCart
(Show Context)
Some of the most effective recent methods for content-based image classifica-tion work by extracting dense or sparse local image descriptors, quantizing them according to a coding rule such as k-means vector quantization, accumulating his-tograms of the resulting “visual word ” codes over the image, and classifying these with a conventional classifier such as an SVM. Large numbers of descriptors and large codebooks are needed for good results and this becomes slow using k-means. We introduce Extremely Randomized Clustering Forests – ensembles of randomly created clustering trees – and show that these provide more accurate results, much faster training and testing and good resistance to background clutter in several state-of-the-art image classification tasks. 1
Class-specific hough forests for object detection
- In Proceedings IEEE Conference Computer Vision and Pattern Recognition
, 2009
"... We present a method for the detection of instances of an object class, such as cars or pedestrians, in natural images. Similarly to some previous works, this is accomplished via generalized Hough transform, where the detections of individual object parts cast probabilistic votes for possible locatio ..."
Abstract
-
Cited by 151 (18 self)
- Add to MetaCart
(Show Context)
We present a method for the detection of instances of an object class, such as cars or pedestrians, in natural images. Similarly to some previous works, this is accomplished via generalized Hough transform, where the detections of individual object parts cast probabilistic votes for possible locations of the centroid of the whole object; the detection hypotheses then correspond to the maxima of the Hough image that accumulates the votes from all parts. However, whereas the previous methods detect object parts using generative codebooks of part appearances, we take a more discriminative approach to object part detection. Towards this end, we train a class-specific Hough forest, which is a random forest that directly maps the image patch appearance to the probabilistic vote about the possible location of the object centroid. We demonstrate that Hough forests improve the results of the Hough-transform object detection significantly and achieve state-of-the-art performance for several classes and datasets. 1.
Visual Word Ambiguity
- ACCEPTED IN IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
"... This paper studies automatic image classification by modeling soft-assignment in the popular codebook model. The codebook model describes an image as a bag of discrete visual words selected from a vocabulary, where the frequency distributions of visual words in an image allow classification. One inh ..."
Abstract
-
Cited by 140 (11 self)
- Add to MetaCart
This paper studies automatic image classification by modeling soft-assignment in the popular codebook model. The codebook model describes an image as a bag of discrete visual words selected from a vocabulary, where the frequency distributions of visual words in an image allow classification. One inherent component of the codebook model is the assignment of discrete visual words to continuous image features. Despite the clear mismatch of this hard assignment with the nature of continuous features, the approach has been applied successfully for some years. In this paper we investigate four types of soft-assignment of visual words to image features. We demonstrate that explicitly modeling visual word assignment ambiguity improves classification performance compared to the hard-assignment of the traditional codebook model. The traditional codebook model is compared against our method for five well-known datasets: 15 natural scenes, Caltech-101, Caltech-256, and Pascal VOC 2007/2008. We demonstrate that large codebook vocabulary sizes completely deteriorate the performance of the traditional model, whereas the proposed model performs consistently. Moreover, we show that our method profits in high-dimensional feature spaces and reaps higher benefits when increasing the number of image categories.
Image superresolution as sparse representation of raw image patches
, 2008
"... This paper addresses the problem of generating a superresolution (SR) image from a single low-resolution input image. We approach this problem from the perspective of compressed sensing. The low-resolution image is viewed as downsampled version of a high-resolution image, whose patches are assumed t ..."
Abstract
-
Cited by 135 (19 self)
- Add to MetaCart
(Show Context)
This paper addresses the problem of generating a superresolution (SR) image from a single low-resolution input image. We approach this problem from the perspective of compressed sensing. The low-resolution image is viewed as downsampled version of a high-resolution image, whose patches are assumed to have a sparse representation with respect to an over-complete dictionary of prototype signalatoms. The principle of compressed sensing ensures that under mild conditions, the sparse representation can be correctly recovered from the downsampled signal. We will demonstrate the effectiveness of sparsity as a prior for regularizing the otherwise ill-posed super-resolution problem. We further show that a small set of randomly chosen raw patches from training images of similar statistical nature to the input image generally serve as a good dictionary, in the sense that the computed representation is sparse and the recovered high-resolution image is competitive or even superior in quality to images produced by other SR methods.
Kernel codebooks for scene categorization
- ECCV 2008, PART III. LNCS
, 2008
"... This paper introduces a method for scene categorization by modeling ambiguity in the popular codebook approach. The codebook approach describes an image as a bag of discrete visual codewords, where the frequency distributions of these words are used for image categorization. There are two drawbacks ..."
Abstract
-
Cited by 127 (4 self)
- Add to MetaCart
(Show Context)
This paper introduces a method for scene categorization by modeling ambiguity in the popular codebook approach. The codebook approach describes an image as a bag of discrete visual codewords, where the frequency distributions of these words are used for image categorization. There are two drawbacks to the traditional codebook model: codeword uncertainty and codeword plausibility. Both of these drawbacks stem from the hard assignment of visual features to a single codeword. We show that allowing a degree of ambiguity in assigning codewords improves categorization performance for three state-of-the-art datasets.