• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

On feature combination for multiclass object classification

by Peter Gehler, Sebastian Nowozin
Venue:IN ICCV
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 259
Next 10 →

VLFeat -- An open and portable library of computer vision algorithms

by Andrea Vedaldi, et al. , 2010
"... ..."
Abstract - Cited by 526 (10 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...s that the PHOW features, efficiently computed by using VLFeat, are a state-of-the-art image descriptors (better classification results can be obtained by combining a variety of different descriptors =-=[5]-=-). 5. DESIGN Granularity. VLFeat avoids implementing specific applications (e.g. there is no face detector); instead it focuses in cimplementing algorithms (feature detectors, clustering, etc.) that c...

Learning mid-level features for recognition

by Y-lan Boureau, Francis Bach, Yann LeCun, Jean Ponce , 2010
"... Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise tra ..."
Abstract - Cited by 228 (13 self) - Add to MetaCart
Many successful models for scene or object recognition transform low-level descriptors (such as Gabor filter responses, or SIFT descriptors) into richer representations of intermediate complexity. This process can often be broken down into two steps: (1) a coding step, which performs a pointwise transformation of the descriptors into a representation better adapted to the task, and (2) a pooling step, which summarizes the coded features over larger neighborhoods. Several combinations of coding and pooling schemes have been proposed in the literature. The goal of this paper is threefold. We seek to establish the relative importance of each step of mid-level feature extraction through a comprehensive cross evaluation of several types of coding modules (hard and soft vector quantization, sparse coding) and pooling schemes (by taking the average, or the maximum), which obtains state-of-the-art performance or better on several recognition benchmarks. We show how to improve the best performing coding scheme by learning a supervised discriminative dictionary for sparse coding. We provide theoretical and empirical insight into the remarkable performance of max pooling. By teasing apart components shared by modern mid-level feature extractors, our approach aims to facilitate the design of better recognition architectures.
(Show Context)

Citation Context

...riptors on the same dataset are shown on Table 2. Note that better performance has been reported with multiple descriptor types (e.g., methods using multiple kernel learning have achieved 77.7% ± 0.3 =-=[7]-=- and 78.0% ± 0.3 [28, 2] on Caltech-101 with 30 training examples), or subcategory learning (83% on Caltech101 [26]). The coding and pooling module combinations used in [27, 31] are included in our co...

Efficient Object Category Recognition Using

by Lorenzo Torresani, Martin Szummer, Andrew Fitzgibbon
"... Abstract. We introduce a new descriptor for images which allows the construction of efficient and compact classifiers with good accuracy on object category recognition. The descriptor is the output of a large number of weakly trained object category classifiers on the image. The trained categories a ..."
Abstract - Cited by 122 (9 self) - Add to MetaCart
Abstract. We introduce a new descriptor for images which allows the construction of efficient and compact classifiers with good accuracy on object category recognition. The descriptor is the output of a large number of weakly trained object category classifiers on the image. The trained categories are selected from an ontology of visual concepts, but the intention is not to encode an explicit decomposition of the scene. Rather, we accept that existing object category classifiers often encode not the category per se but ancillary image characteristics; and that these ancillary characteristics can combine to represent visual classes unrelated to the constituent categories ’ semantic meanings. The advantage of this descriptor is that it allows object-category queries to be made against image databases using efficient classifiers (efficient at test time) such as linear support vector machines, and allows these queries to be for novel categories. Even when the representation is reduced to 200 bytes per image, classification accuracy on object category recognition is comparable with the state of the art (36 % versus 42%), but at orders of magnitude lower computational cost.
(Show Context)

Citation Context

...am of visual words is stored, d is the minimum of the number of words detected per image and the vocabulary size. For a GIST descriptor [19], d is of the order of 1000. For multiple-kernel techniques =-=[6]-=-, d might be of the order of 20, 000. For the system in this paper, d can be as low as 1500, while still leveraging all the descriptors used in the multiple-kernel technique. Note that although we sha...

What does classifying more than 10,000 image categories tell us?

by Jia Deng, Alexander C. Berg, Kai Li, Li Fei-Fei
"... Image classification is a critical task for both humans and computers. One of the challenges lies in the large scale of the semantic space. In particular, humans can recognize tens of thousands of object classes and scenes. No computer vision algorithm today has been tested at this scale. This pape ..."
Abstract - Cited by 118 (11 self) - Add to MetaCart
Image classification is a critical task for both humans and computers. One of the challenges lies in the large scale of the semantic space. In particular, humans can recognize tens of thousands of object classes and scenes. No computer vision algorithm today has been tested at this scale. This paper presents a study of large scale categorization including a series of challenging experiments on classification with more than 10, 000 image classes. We find that a) computational issues become crucial in algorithm design; b) conventional wisdom from a couple of hundred image categories on relative performance of different classifiers does not necessarily hold when the number of categories increases; c) there is a surprisingly strong relationship between the structure of WordNet (developed for studying language) and the difficulty of visual categorization; d) classification can be improved by exploiting the semantic hierarchy. Toward the future goal of developing automatic vision algorithms to recognize tens of thousands or even millions of image categories, we make a series of observations and arguments about dataset scale, category density, and image hierarchy.
(Show Context)

Citation Context

...BoW or histogram of oriented gradient (HOG) [18, 4] features. In the current state-of-the-art, multiple descriptors and kernels are combined using either ad hoc or multiple kernel learning approaches =-=[19, 5, 20, 21]-=-.What does classifying more than 10,000 image categories tell us? 3 Work in machine learning supports using winner-takes-all between 1-vs-all classifiers for the final multi-class classification deci...

Struck: Structured Output Tracking with Kernels

by Sam Hare, Amir Saffari, Philip H. S. Torr
"... Adaptive tracking-by-detection methods are widely used in computer vision for tracking arbitrary objects. Current approaches treat the tracking problem as a classification task and use online learning techniques to update the object model. However, for these updates to happen one needs to convert th ..."
Abstract - Cited by 111 (4 self) - Add to MetaCart
Adaptive tracking-by-detection methods are widely used in computer vision for tracking arbitrary objects. Current approaches treat the tracking problem as a classification task and use online learning techniques to update the object model. However, for these updates to happen one needs to convert the estimated object position into a set of labelled training examples, and it is not clear how best to perform this intermediate step. Furthermore, the objective for the classifier (label prediction) is not explicitly coupled to the objective for the tracker (accurate estimation of object position). In this paper, we present a framework for adaptive visual object tracking based on structured output prediction. By explicitly allowing the output space to express the needs of the tracker, we are able to avoid the need for an intermediate classification step. Our method uses a kernelized structured output support vector machine (SVM), which is learned online to provide adaptive tracking. To allow for real-time application, we introduce a budgeting mechanism which prevents the unbounded growth in the number of support vectors which would otherwise occur during tracking. Experimentally, we show that our algorithm is able to outperform state-of-the-art trackers on various benchmark videos. Additionally, we show that we can easily incorporate additional features and kernels into our framework, which results in increased performance. 1.
(Show Context)

Citation Context

...rs in a square image. 3 Please refer to supplementary material for illustrative videos. Such an approach can be considered a basic form of multiple kernel learning (MKL), and indeed it has been shown =-=[9]-=- that in terms of performance full MKL (in which the relative weighting of the different kernels is learned from training data) does not provide a great deal of improvement over this simple approach. ...

Object, scene and actions: combining multiple features for human action recognition

by Nazli Ikizler-cinbis, Stan Sclaroff - In ECCV , 2010
"... Abstract. In many cases, human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. In this paper, we look into this problem and propose an approach for human action recognition that integrat ..."
Abstract - Cited by 74 (1 self) - Add to MetaCart
Abstract. In many cases, human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. In this paper, we look into this problem and propose an approach for human action recognition that integrates multiple feature channels from several entities such as objects, scenes and people. We formulate the problem in a multiple instance learning (MIL) framework, based on multiple feature channels. By using a discriminative approach, we join multiple feature channels embedded to the MIL space. Our experiments over the large YouTube dataset show that scene and object information can be used to complement person features for human action recognition. 1
(Show Context)

Citation Context

...ond method, which employs a joint formulation for learning the global weights for feature channels. This global weighting is analogous to learning the kernel weights in multiple kernel learning (MKL) =-=[11]-=-. In MKL, the task is to select informative kernels, whereas here we try to select informative feature channels. We formulate the optimization as follows: simply assign the correspondingm(B f i min w,...

Kernel Descriptors for Visual Recognition

by Liefeng Bo, Xiaofeng Ren, Dieter Fox
"... The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show th ..."
Abstract - Cited by 69 (13 self) - Add to MetaCart
The design of low-level image features is critical for computer vision algorithms. Orientation histograms, such as those in SIFT [16] and HOG [3], are the most successful and popular features for visual object and scene recognition. We highlight the kernel view of orientation histograms, and show that they are equivalent to a certain type of match kernels over image patches. This novel view allows us to design a family of kernel descriptors which provide a unified and principled framework to turn pixel attributes (gradient, color, local binary pattern, etc.) into compact patch-level features. In particular, we introduce three types of match kernels to measure similarities between image patches, and construct compact low-dimensional kernel descriptors from these match kernels using kernel principal component analysis (KPCA) [23]. Kernel descriptors are easy to design and can turn any type of pixel attribute into patch-level features. They outperform carefully tuned and sophisticated features including SIFT and deep belief networks. We report superior performance on standard image classification benchmarks: Scene-15, Caltech-101, CIFAR10 and CIFAR10-ImageNet. 1
(Show Context)

Citation Context

...rns. Match kernels defined over various pixel attributes provide a unified way to generate a rich, diverse visual feature set, which has been shown to be very successful to boost recognition accuracy =-=[6]-=-. As validated by our own experiments, gradient, color and shape match kernels are strong in their own right and complement one another. Their combination turn out to be always (much) better than the ...

Visual classification with multi-task joint sparse representation

by Xiao-tong Yuan, Xiaobai Liu, Shuicheng Yan, Senior Member - In CVPR , 2010
"... Abstract — We address the problem of visual classification with multiple features and/or multiple instances. Motivated by the recent success of multitask joint covariate selection, we formulate this problem as a multitask joint sparse representation model to combine the strength of multiple features ..."
Abstract - Cited by 66 (1 self) - Add to MetaCart
Abstract — We address the problem of visual classification with multiple features and/or multiple instances. Motivated by the recent success of multitask joint covariate selection, we formulate this problem as a multitask joint sparse representation model to combine the strength of multiple features and/or instances for recognition. A joint sparsity-inducing norm is utilized to enforce class-level joint sparsity patterns among the multiple representation vectors. The proposed model can be efficiently optimized by a proximal gradient method. Furthermore, we extend our method to the setup where features are described in kernel matrices. We then investigate into two applications of our method to visual classification: 1) fusing multiple kernel features for object categorization and 2) robust face recognition in video with an ensemble of query images. Extensive experiments on challenging real-world data sets demonstrate that the proposed method is competitive to the state-of-the-art methods in respective applications. Index Terms — Feature fusion, multitask learning, sparse representation, visual classification.
(Show Context)

Citation Context

...ors for object categorization. Our experimental results on some benchmark data sets show that MTJSRC is competitive to several state-of-the-art multiple kernel learning methods for object recognition =-=[25]-=-–[27]. 2) Face recognition for dynamic videos. In this application, we assume that each video sequence contains only one single subject. Given an input video, we first employ a face detector [28] to l...

Ask the locals: multi-way local pooling for image recognition

by Y-Lan Boureau , Nicolas Le Roux , Francis Bach , Jean Ponce , Yann Lecun - in ICCV’11. IEEE , 2011
"... ..."
Abstract - Cited by 63 (3 self) - Add to MetaCart
Abstract not found
(Show Context)

Citation Context

...formance has been reported with color images (e.g., 78.5% ± 0.4 with a saliency-based approach [22]), multiple descriptor types (e.g., methods using multiple kernel learning have achieved 77.7% ± 0.3 =-=[12]-=-, 78.0% ± 0.3 [2, 37] and 84.3% [40] on Caltech-101 with 30 training examples), or subcategory learning (83% on Caltech-101 [35]). On the Scenes benchmark, preclustering does improve results for small...

Exploiting weakly-labeled Web images to improve object classification: a domain adaptation approach

by Alessandro Bergamo, Lorenzo Torresani
"... Most current image categorization methods require large collections of manually annotated training examples to learn accurate visual recognition models. The time-consuming human labeling effort effectively limits these approaches to recognition problems involving a small number of different object c ..."
Abstract - Cited by 61 (0 self) - Add to MetaCart
Most current image categorization methods require large collections of manually annotated training examples to learn accurate visual recognition models. The time-consuming human labeling effort effectively limits these approaches to recognition problems involving a small number of different object classes. In order to address this shortcoming, in recent years several authors have proposed to learn object classifiers from weakly-labeled Internet images, such as photos retrieved by keyword-based image search engines. While this strategy eliminates the need for human supervision, the recognition accuracies of these methods are considerably lower than those obtained with fully-supervised approaches, because of the noisy nature of the labels associated to Web data. In this paper we investigate and compare methods that learn image classifiers by combining very few manually annotated examples (e.g., 1-10 images per class) and a large number of weakly-labeled Web photos retrieved using keyword-based image search. We cast this as a domain adaptation problem: given a few stronglylabeled examples in a target domain (the manually annotated examples) and many source domain examples (the weakly-labeled Web photos), learn classifiers yielding small generalization error on the target domain. Our experiments demonstrate that, for the same number of strongly-labeled examples, our domain adaptation approach produces significant recognition rate improvements over the best published results (e.g., 65 % better when using 5 labeled training examples per class) and that our classifiers are one order of magnitude faster to learn and to evaluate than the best competing method, despite our use of large weakly-labeled data sets. 1
(Show Context)

Citation Context

...efficient learning and test evaluation. The current best published results on Caltech256 were obtained by a kernel combination classifier using 39 different feature kernels, one for each feature type =-=[13]-=-. However, since both training as well testing are computationally very expensive with this classifier, this model is unsuitable for our needs. 4Instead, in this work we use as image representation t...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University