Results 1 - 10
of
101
LDAHash: Improved matching with smaller descriptors
, 2010
"... SIFT-like local feature descriptors are ubiquitously employed in such computer vision applications as content-based retrieval, video analysis, copy detection, object recognition, photo-tourism and 3D reconstruction. Feature descriptors can be designed to be invariant to certain classes of photometri ..."
Abstract
-
Cited by 80 (10 self)
- Add to MetaCart
(Show Context)
SIFT-like local feature descriptors are ubiquitously employed in such computer vision applications as content-based retrieval, video analysis, copy detection, object recognition, photo-tourism and 3D reconstruction. Feature descriptors can be designed to be invariant to certain classes of photometric and geometric transformations, in particular, affine and intensity scale transformations. However, real transformations that an image can undergo can only be approximately modeled in this way, and thus most descriptorsareonlyapproximatelyinvariantinpractice. Secondly, descriptors are usually high-dimensional (e.g. SIFT is represented as a 128-dimensional vector). In large-scale retrieval and matching problems, this can pose challenges in storing and retrieving descriptor data. We map the descriptor vectors into the Hamming space, in which the Hamming metric is used to compare the resulting representations. This way, we reduce the size of the descriptors by representing them as short binary strings and learn descriptor invariance from examples. We show extensive experimental validation, demonstrating the advantage of the proposed approach.
Multimodal feature fusion for robust event detection in web videos. In:
- IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
, 2012
"... Abstract Combining multiple low-level visual features is a proven and effective strategy for a range of computer vision tasks. However, limited attention has been paid to combining such features with information from other modalities, such as audio and videotext, for large scale analysis of web vid ..."
Abstract
-
Cited by 37 (0 self)
- Add to MetaCart
Abstract Combining multiple low-level visual features is a proven and effective strategy for a range of computer vision tasks. However, limited attention has been paid to combining such features with information from other modalities, such as audio and videotext, for large scale analysis of web videos. In our work, we rigorously analyze and combine a large set of low-level features that capture appearance, color, motion, audio and audio-visual co-occurrence patterns in videos. We also evaluate the utility of high-level (i.e., semantic) visual information obtained from detecting scene, object, and action concepts. Further, we exploit multimodal information by analyzing available spoken and videotext content using state-of-the-art automatic speech recognition (ASR) and videotext recognition systems. We combine these diverse features using a two-step strategy employing multiple kernel learning (MKL) and late score level fusion methods. Based on the TRECVID MED 2011 evaluations for detecting 10 events in a large benchmark set of ∼45000 videos, our system showed the best performance among the 19 international teams.
Unified real-time tracking and recognition with rotation-invariant fast features
- IN [PROC. IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR’10
, 2010
"... We present a method that unifies tracking and video content recognition with applications to Mobile Augmented Reality (MAR). We introduce the Radial Gradient Transform (RGT) and an approximate RGT, yielding the Rotation-Invariant, Fast Feature (RIFF) descriptor. We demonstrate that RIFF is fast enou ..."
Abstract
-
Cited by 33 (11 self)
- Add to MetaCart
(Show Context)
We present a method that unifies tracking and video content recognition with applications to Mobile Augmented Reality (MAR). We introduce the Radial Gradient Transform (RGT) and an approximate RGT, yielding the Rotation-Invariant, Fast Feature (RIFF) descriptor. We demonstrate that RIFF is fast enough for real-time tracking, while robust enough for large scale retrieval tasks. At 26 × the speed, our trackingscheme obtains a more accurate global affine motionmodel than the Kanade Lucas Tomasi (KLT) tracker. The same descriptors can achieve 94^% retrieval accuracy from a database of 10 4 images.
Mobile Visual Search
- IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON MOBILE MEDIA SEARCH
"... MOBILE phones have evolved into powerful image and video processing devices, equipped with highresolution cameras, color displays, and hardware-accelerated graphics. They are increasingly also equipped with GPS, and connected to broadband wireless networks. All this enables a new class of applicatio ..."
Abstract
-
Cited by 26 (7 self)
- Add to MetaCart
MOBILE phones have evolved into powerful image and video processing devices, equipped with highresolution cameras, color displays, and hardware-accelerated graphics. They are increasingly also equipped with GPS, and connected to broadband wireless networks. All this enables a new class of applications which use the camera phone to initiate search queries about objects in visual proximity to the user (Fig 1). Such applications can be used, e.g., for identifying products, comparison shopping, finding information about movies, CDs, real estate, print media or artworks. First deployments of such systems include Google Goggles [1], Nokia Point and Find [2], Kooaba [3], Ricoh iCandy [4], [5], [6] and Amazon Snaptell [7]. Mobile image retrieval applications pose a unique set of challenges. What part of the processing should be performed
Mobile product search with bag of hash bits
- In Demo session of ACM MM
, 2011
"... Rapidly growing applications on smartphones have provided an excellent platform for mobile visual search. Most of previous visual search systems adopt the framework of ”Bag of Words”, in which words indicate quantized codes of visual features. In this work, we propose a novel visual search system ba ..."
Abstract
-
Cited by 20 (2 self)
- Add to MetaCart
(Show Context)
Rapidly growing applications on smartphones have provided an excellent platform for mobile visual search. Most of previous visual search systems adopt the framework of ”Bag of Words”, in which words indicate quantized codes of visual features. In this work, we propose a novel visual search system based on ”Bag of Hash Bits ” (BoHB), in which each local feature is encoded to a very small number of hash bits, instead of quantized to visual words, and the whole image is represented as bag of hash bits. The proposed BoHB method offers unique benefits in solving the challenges associated with mobile visual search, e.g., low transmission cost, cheap memory and computation on the mobile side, etc. Moreover, our BoHB method leverages the distinct properties of hashing bits such as multi-table indexing, multiple bucket probing, bit reuse, and hamming distance based ranking to achieve efficient search over gigantic visual databases. The proposed method significantly outperforms state-of-the-art mobile visual search methods like CHoG, and other (conventional desktop) visual search approaches like bag of words via vocabulary tree, or product quantization. The proposed BoHB approach is easy to implement on mobile devices, and general in the sense that it can be applied to different types of local features, hashing algorithms and image databases. We also incorporate a boundary feature in the reranking step to describe the object shapes, complementing the local features that are usually used to characterize the local details. The boundary feature can further filter out noisy results and improve the search performance, especially at the coarse category level. Extensive experiments over large-scale data sets up to 400k product images demonstrate the effectiveness of our approach.
Compact video description for copy detection with precise temporal alignment
"... Abstract. This paper introduces a very compact yet discriminative video description, which allows example-based search in a large number of frames corresponding to thousands of hours of video. Our description extracts one descriptor per indexed video frame by aggregating a set of local descriptors. ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
(Show Context)
Abstract. This paper introduces a very compact yet discriminative video description, which allows example-based search in a large number of frames corresponding to thousands of hours of video. Our description extracts one descriptor per indexed video frame by aggregating a set of local descriptors. These frame descriptors are encoded using a time-aware hierarchical indexing structure. A modified temporal Hough voting scheme is used to rank the retrieved database videos and estimate segments in them that match the query. If we use a dense temporal description of the videos, matched video segments are localized with excellent precision. Experimental results on the Trecvid 2008 copy detection task and a set of 38000 videos from YouTube show that our method offers an excellent trade-off between search accuracy, efficiency and memory usage. 1
Location Discriminative Vocabulary Coding for Mobile Landmark Search
- INT J COMPUT VIS
, 2011
"... With the popularization of mobile devices, recent years have witnessed an emerging potential for mobile landmark search. In this scenario, the user experience heavily depends on the efficiency of query transmission over a wireless link. As sending a query photo is time consuming, recent works have ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
With the popularization of mobile devices, recent years have witnessed an emerging potential for mobile landmark search. In this scenario, the user experience heavily depends on the efficiency of query transmission over a wireless link. As sending a query photo is time consuming, recent works have proposed to extract compact visual descriptors directly on the mobile end towards low bit rate transmission. Typically, these descriptors are extracted based solely on the visual content of a query, and the location cues from the mobile end are rarely exploited. In this paper, we present a Location Discriminative Vocabulary Coding (LDVC) scheme, which achieves extremely low bit rate query transmission, discriminative landmark description, as well as scalable descriptor delivery in a unified framework. Our first contribution is a compact and location discriminative visual landmark descriptor, which is offline learnt in two-step: First, we adopt spectral clustering to segment a city map into distinct geographical regions, where both visual and geographical similarities are fused to optimize the partition of cityscale geo-tagged photos. Second, we propose to learn LDVC in each region with two schemes: (1) a Ranking Sensitive
Quantized embeddings of scale-invariant image features for mobile augmented reality
- IN PROC. IEEE INTERNATIONAL WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING (MMSP
, 2012
"... Randomized embeddings of scale-invariant image features are proposed for retrieval of object-specific meta data in an augmented reality application. The method extracts scale invariant features from a query image, computes a small number of quantized random projections of these features, and sends ..."
Abstract
-
Cited by 11 (8 self)
- Add to MetaCart
Randomized embeddings of scale-invariant image features are proposed for retrieval of object-specific meta data in an augmented reality application. The method extracts scale invariant features from a query image, computes a small number of quantized random projections of these features, and sends them to a database server. The server performs a nearest neighbor search in the space of the random projections and returns meta-data corresponding to the query image. Prior work has shown that binary embeddings of image features enable efficient image retrieval. This paper generalizes the prior art by characterizing the tradeoff between the number of random projections and the number of bits used to represent each projection. The theoretical results suggest a bit allocation scheme under a total bit rate constraint: It is often advisable to spend bits on a small number of finely quantized random measurements rather than on a large number of coarsely quantized random measurements. This theoretical result is corroborated via experimental study of the above tradeoff using the ZuBuD database. The proposed scheme achieves a retrieval accuracy up to 94 % while requiring the mobile device to transmit only 2.5 kB to the database server, a significant improvement over 1-bit quantization schemes reported in prior art.
The Stanford Mobile Visual Search Data Set
"... We survey popular data sets used in computer vision literature and point out their limitations for mobile visual search applications. To overcome many of the limitations, we propose the Stanford Mobile Visual Search data set. The data set contains camera-phone images of products, CDs, books, outdoor ..."
Abstract
-
Cited by 11 (2 self)
- Add to MetaCart
(Show Context)
We survey popular data sets used in computer vision literature and point out their limitations for mobile visual search applications. To overcome many of the limitations, we propose the Stanford Mobile Visual Search data set. The data set contains camera-phone images of products, CDs, books, outdoor landmarks, business cards, text documents, museum paintings and video clips. The data set has several key characteristics lacking in existing data sets: rigid objects, widely varying lighting conditions, perspective distortion, foreground and background clutter, realistic ground-truth reference data, and query data collected from heterogeneous low and high-end camera phones. We hope that the data set will help push research forward in the field of mobile visual search.
Efficient Discriminative Projections for Compact Binary Descriptors
"... Abstract. Binary descriptors of image patches are increasingly popular given that they require less storage and enable faster processing. This, however, comes at a price of lower recognition performances. To boost these performances, we project the image patches to a more discriminative subspace, an ..."
Abstract
-
Cited by 10 (1 self)
- Add to MetaCart
(Show Context)
Abstract. Binary descriptors of image patches are increasingly popular given that they require less storage and enable faster processing. This, however, comes at a price of lower recognition performances. To boost these performances, we project the image patches to a more discriminative subspace, and threshold their coordinates to build our binary descriptor. However, applying complex projections to the patches is slow, which negates some of the advantages of binary descriptors. Hence, our key idea is to learn the discriminative projections so that they can be decomposed into a small number of simple filters for which the responses can be computed fast. We show that with as few as 32 bits per descriptor we outperform the state-of-the-art binary descriptors in terms of both accuracy and efficiency. 1