Results 1 - 10
of
42
Bundling features for large scale partial-duplicate web image search
, 2009
"... In state-of-the-art image retrieval systems, an image is represented by a bag of visual words obtained by quantizing high-dimensional local image descriptors, and scalable schemes inspired by text retrieval are then applied for large scale image indexing and retrieval. Bag-of-words representations, ..."
Abstract
-
Cited by 81 (3 self)
- Add to MetaCart
(Show Context)
In state-of-the-art image retrieval systems, an image is represented by a bag of visual words obtained by quantizing high-dimensional local image descriptors, and scalable schemes inspired by text retrieval are then applied for large scale image indexing and retrieval. Bag-of-words representations, however: 1) reduce the discriminative power of image features due to feature quantization; and 2) ignore geometric relationships among visual words. Exploiting such geometric constraints, by estimating a 2D affine transformation between a query image and each candidate image, has been shown to greatly improve retrieval precision but at high computational cost. In this paper we present a novel scheme where image features are bundled into local groups. Each group of bundled features becomes much more discriminative than a single feature, and within each group simple and robust geometric constraints can be efficiently enforced. Experiments in web image search, with a database of more than one million images, show that our scheme achieves a 49 % improvement in average precision over the baseline bag-of-words approach. Retrieval performance is comparable to existing full geometric verification approaches while being much less computationally expensive. When combined with full geometric verification we achieve a 77 % precision improvement over the baseline bag-of-words approach, and a 24 % improvement over full geometric verification alone. 1.
Descriptive visual words and visual phrases for image applications
- Proc. ACM Multimedia
, 2009
"... The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the words in texts. Howe ..."
Abstract
-
Cited by 37 (10 self)
- Add to MetaCart
(Show Context)
The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the words in texts. However, massive experiments show that the commonly used visual words are not as expressive as the text words, which is not desirable because it hinders their effectiveness in various applications. In this paper, Descriptive Visual Words (DVWs) and Descriptive Visual Phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, novel descriptive visual element set can be composed by the visual words and their combinations which
Modeling spatial layout with Fisher vectors for image categorization
- IN: ICCV.
, 2011
"... We introduce an extension of bag-of-words image representations to encode spatial layout. Using the Fisher kernel framework we derive a representation that encodes the spatial mean and the variance of image regions associated with visual words. We extend this representation by using a Gaussian mixtu ..."
Abstract
-
Cited by 33 (11 self)
- Add to MetaCart
We introduce an extension of bag-of-words image representations to encode spatial layout. Using the Fisher kernel framework we derive a representation that encodes the spatial mean and the variance of image regions associated with visual words. We extend this representation by using a Gaussian mixture model to encode spatial layout, and show that this model is related to a soft-assign version of the spatial pyramid representation. We also combine our representation of spatial layout with the use of Fisher kernels to encode the appearance of local features. Through an extensive experimental evaluation, we show that our representation yields state-of-the-art image categorization results, while being more compact than spatial pyramid representations. In particular, using Fisher kernels to encode both appearance and spatial layout results in an image representation that is computationally efficient, compact, and yields excellent performance while using linear classifiers.
S.: Spatial pyramid co-occurrence for image classification
- In: ICCV. (2011
"... We describe a novel image representation termed spatial pyramid co-occurrence which characterizes both the photometric and geometric aspects of an image. Specifically, the co-occurrences of visual words are computed with respect to spatial predicates over a hierarchical spatial partitioning of an im ..."
Abstract
-
Cited by 24 (0 self)
- Add to MetaCart
(Show Context)
We describe a novel image representation termed spatial pyramid co-occurrence which characterizes both the photometric and geometric aspects of an image. Specifically, the co-occurrences of visual words are computed with respect to spatial predicates over a hierarchical spatial partitioning of an image. The representation captures both the absolute and relative spatial arrangement of the words and, through the choice and combination of the predicates, can characterize a variety of spatial relationships. Our representation is motivated by the analysis of overhead imagery such as from satellites or aircraft. This imagery generally does not have an absolute reference frame and thus the relative spatial arrangement of the image elements often becomes the key discriminating feature. We validate this hypothesis using a challenging ground truth image dataset of 21 land-use classes manually extracted from high-resolution aerial imagery. Our approach is shown to result in higher classification rates than a non-spatial bagof-visual-words approach as well as a popular approach for characterizing the absolute spatial arrangement of visual words, the spatial pyramid representation of Lazebnik et al. [7]. While our primary objective is analyzing overhead imagery, we demonstrate that our approach achieves stateof-the-art performance on the Graz-01 object class dataset and performs competitively on the 15 Scene dataset. 1.
Efficient kernels for identifying unbounded-order spatial features
- In CVPR
, 2009
"... Higher order spatial features, such as doublets or triplets have been used to incorporate spatial information into the bag-of-local-features model. Due to computational limits, researchers have only been using features up to the 3rd order, i.e., triplets, since the number of features increases expon ..."
Abstract
-
Cited by 17 (4 self)
- Add to MetaCart
(Show Context)
Higher order spatial features, such as doublets or triplets have been used to incorporate spatial information into the bag-of-local-features model. Due to computational limits, researchers have only been using features up to the 3rd order, i.e., triplets, since the number of features increases exponentially with the order. We propose an algorithm for identifying high-order spatial features efficiently. The algorithm directly evaluates the inner product of the feature vectors from two images to be compared, identifying all high-order features automatically. The algorithm hence serves as a kernel for any kernel-based learning algorithms. The algorithm is based on the idea that if a high-order spatial feature co-occurs in both images, the occurrence of the feature in one image would be a translation from the occurrence of the same feature in the other image. This enables us to compute the kernel in time that is linear to the number of local features in an image (same as the bag of local features approach), regardless of the order. Therefore, our algorithm does not limit the upper bound of the order as in previous work. The experiment results on the object categorization task show that high order features can be calculated efficiently and provide significant improvement in object categorization performance. 1.
Mining discriminative co-occurrence patterns for visual recognition
- In CVPR: Conf. on Computer Vision and Pattern Recognition
, 2011
"... The co-occurrence pattern, a combination of binary or local features, is more discriminative than individual features and has shown its advantages in object, scene, and action recognition. We discuss two types of co-occurrence patterns that are complementary to each other, the conjunction (AND) and ..."
Abstract
-
Cited by 16 (2 self)
- Add to MetaCart
(Show Context)
The co-occurrence pattern, a combination of binary or local features, is more discriminative than individual features and has shown its advantages in object, scene, and action recognition. We discuss two types of co-occurrence patterns that are complementary to each other, the conjunction (AND) and disjunction (OR) of binary features. The necessary condition of identifying discriminative cooccurrence patterns is firstly provided. Then we propose a novel data mining method to efficiently discover the optimal co-occurrence pattern with minimum empirical error, despite the noisy training dataset. This mining procedure of AND and OR patterns is readily integrated to boosting, which improves the generalization ability over the conventional boosting decision trees and boosting decision stumps. Our versatile experiments on object, scene, and action categorization validate the advantages of the discovered discriminative co-occurrence patterns. 1.
Unsupervised Learning of Hierarchical Spatial Structures In Images
"... The visual world demonstrates organized spatial patterns, among objects or regions in a scene, object-parts in an object, and low-level features in object-parts. These classes of spatial structures are inherently hierarchical in nature. Although seemingly quite different these spatial patterns are s ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
(Show Context)
The visual world demonstrates organized spatial patterns, among objects or regions in a scene, object-parts in an object, and low-level features in object-parts. These classes of spatial structures are inherently hierarchical in nature. Although seemingly quite different these spatial patterns are simply manifestations of different levels in a hierarchy. In this work, we present a unified approach to unsupervised learning of hierarchical spatial structures from a collection of images. Ours is a hierarchical rule-based model capturing spatial patterns, where each rule is represented by a star-graph. We propose an unsupervised EMstyle algorithm to learn our model from a collection of images. We show that the inference problem of determining the set of learnt rules instantiated in an image is equivalent to finding the minimum-cost Steiner tree in a directed acyclic graph. We evaluate our approach on a diverse set of data sets of object categories, natural outdoor scenes and images from complex street scenes with multiple objects. 1.
Spatial pooling of heterogeneous features for image applications
- in: ACM Multimedia
, 2012
"... important role for image representation in many multimedia applications. Despite the advantages of this model, there are also notable drawbacks, including poor power of semantic expression of local descriptors, and lack of robust structures upon single visual words. To overcome these problems, vario ..."
Abstract
-
Cited by 9 (6 self)
- Add to MetaCart
(Show Context)
important role for image representation in many multimedia applications. Despite the advantages of this model, there are also notable drawbacks, including poor power of semantic expression of local descriptors, and lack of robust structures upon single visual words. To overcome these problems, various techniques have been proposed, such as multiple descriptors, spatial context modeling and interest region detection. Though they have been proven to improve the BoF model to some extent, there still lacks a coherent scheme to integrate each individual module. To address the problems above, we propose a novel framework. Our model differs from the
A Multi-Sample, Multi-Tree Approach to Bag-of-Words Image Representation for Image Retrieval
"... The state-of-the-art content based image retrieval systems has been significantly advanced by the introduction of SIFT features and the bag-of-words image representation. Converting an image into a bag-of-words, however, involves three non-trivial steps: feature detection, feature description, and f ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
(Show Context)
The state-of-the-art content based image retrieval systems has been significantly advanced by the introduction of SIFT features and the bag-of-words image representation. Converting an image into a bag-of-words, however, involves three non-trivial steps: feature detection, feature description, and feature quantization. At each of these steps, there is a significant amount of information lost, and the resulted visual words are often not discriminative enough for large scale image retrieval applications. In this paper, we propose a novel multi-sample multi-tree approach to computing the visual word codebook. By encoding more information of the original image feature, our approach generates a much more discriminative visual word codebook that is also efficient in terms of both computation and space consumption, without losing the original repeatability of the visual features. We evaluate our approach using both a groundtruth data set and a real-world large scale image database. Our results show that a significant improvement in both precision and recall can be achieved by using the codebook derived from our approach. 1.
Visual phraselet: Refining spatial constraints for large scale image search
- Signal Processing Letters, IEEE
"... Abstract—The Bag-of-Words (BoW) model is prone to the deficiency of spatial constraints among visual words. The state of the art methods encode spatial information via visual phrases. However, these methods discard the spatial context among visual phrases instead. To address the problem, this letter ..."
Abstract
-
Cited by 7 (6 self)
- Add to MetaCart
(Show Context)
Abstract—The Bag-of-Words (BoW) model is prone to the deficiency of spatial constraints among visual words. The state of the art methods encode spatial information via visual phrases. However, these methods discard the spatial context among visual phrases instead. To address the problem, this letter introduces a novel visual concept, the Visual Phraselet, as a kind of similarity measurement between images. The visual phraselet refers to the spatial consistent group of visual phrases. In a simple yet effective manner, visual phraselet filters out false visual phrase matches, and is much more discriminative than both visual word and visual phrase. To boost the discovery of visual phraselets, we apply the soft quantization scheme. Our method is evaluated through extensive experiments on three benchmark datasets (Oxford 5 K, Paris 6 K and Flickr 1 M). We report significant improvements as large as 54.6 % over the baseline approach, thus validating the concept of visual phraselet. Index Terms—Image search, spatial constraint, visual phrase, vi-sual phraselet. I.