• Documents
  • Authors
  • Tables
  • Log in
  • Sign up
  • MetaCart
  • DMCA
  • Donate

CiteSeerX logo

Advanced Search Include Citations
Advanced Search Include Citations

A.: Discriminative object class models of appearance and shape by correlatons. (2006)

by S Savarese, J Winn, Criminisi
Venue:In: CVPR,
Add To MetaCart

Tools

Sorted by:
Results 1 - 10 of 78
Next 10 →

Unsupervised learning of human action categories using spatial-temporal words

by Juan Carlos Niebles, Hongcheng Wang, Li Fei-fei - In Proc. BMVC , 2006
"... Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences ..."
Abstract - Cited by 494 (8 self) - Add to MetaCart
Imagine a video taken on a sunny beach, can a computer automatically tell what is happening in the scene? Can it identify different human activities in the video, such as water surfing, people walking and lying on the beach? To automatically classify or localize different actions in video sequences is very useful for a variety of tasks, such as video surveillance, objectlevel video summarization, video indexing, digital library organization, etc. However, it remains a challenging task for computers to achieve robust action recognition due to cluttered background, camera motion, occlusion, and geometric and photometric variances of objects. For example, in a live video of a skating competition, the skater moves rapidly across the rink, and the camera also moves to follow the skater. With moving camera, non-stationary background, and moving target, few vision algorithms could identify, categorize and

Auto-context and its Application to High-level Vision Tasks

by Zhuowen Tu, Xiang Bai - In Proc. CVPR
"... The notion of using context information for solving high-level vision and medical image segmentation problems has been increasingly realized in the field. However, how to learn an effective and efficient context model, together with an image appearance model, remains mostly unknown. The current lite ..."
Abstract - Cited by 156 (6 self) - Add to MetaCart
The notion of using context information for solving high-level vision and medical image segmentation problems has been increasingly realized in the field. However, how to learn an effective and efficient context model, together with an image appearance model, remains mostly unknown. The current literature using Markov Random Fields (MRFs) and Conditional Random Fields (CRFs) often involves specific algorithm design, in which the modeling and computing stages are studied in isolation. In this paper, we propose the auto-context algorithm. Given a set of training images and their corresponding label maps, we first learn a classifier on local image patches. The discriminative probability (or classification confidence) maps created by the learned classifier are then used as context information, in addition to the original image patches, to train a new classifier. The algorithm then iterates until convergence. Auto-context integrates low-level and context information by fusing a large number of low-level appearance features with context and implicit shape information. The resulting discriminative algorithm is general and easy to implement. Under nearly the same parameter settings in training, we apply the algorithm to three challenging vision applications: foreground/background segregation, human body configuration estimation, and scene region labeling. Moreover, context also plays a very important role in medical/brain images where the anatomical structures are mostly constrained to relatively fixed positions. With only some slight changes resulting from using 3D instead of 2D features, the auto-context algorithm applied to brain MRI image segmentation is shown to outperform state-of-the-art algorithms specifically designed for this domain. Furthermore, the scope of the proposed algorithm goes beyond image analysis and it has the potential to be used for a wide variety of problems in multi-variate labeling.
(Show Context)

Citation Context

...arning the parameters and computing for the solutions. From the point of view of using context information, there have been a lot of recent work proposed in object recognition and scene understanding =-=[8, 19, 16, 22, 18, 28, 7, 20]-=-. A pioneering work was proposed by Belongie et al. [2] which uses shape context in shape matching. In this paper, we make an effort to address some of the questions mentioned above by proposing an au...

Learning human action via information maximization, CVPR

by Jingen Liu, Mubarak Shah , 2008
"... In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. ..."
Abstract - Cited by 152 (13 self) - Add to MetaCart
In this paper, we present a novel approach for automatically learning a compact and yet discriminative appearance-based human action model. A video sequence is represented by a bag of spatiotemporal features called video-words by quantizing the extracted 3D interest points (cuboids) from the videos. Our proposed approach is able to automatically discover the optimal number of videoword clusters by utilizing Maximization of Mutual Information(MMI). Unlike the k-means algorithm, which is typically used to cluster spatiotemporal cuboids into video words based on their appearance similarity, MMI clustering further groups the video-words, which are highly correlated to some group of actions. To capture the structural information of the learnt optimal video-word clusters, we explore the correlation of the compact video-word clusters. We use the modified correlgoram, which is not only translation and rotation invariant, but also somewhat scale invariant. We extensively test our proposed approach on two publicly available challenging datasets: the KTH dataset and IXMAS multiview dataset. To the best of our knowledge, we are the first to try the bag of video-words related approach on the multiview dataset. We have obtained very impressive results on both datasets. 1.
(Show Context)

Citation Context

...belong to the same body. Fig.1 shows some examples of spatial distribution of the cuboids. In our work, we apply the correlogram which has been successively applied for image and scene classification =-=[7, 17]-=-. The modified correlogram is able to somewhat cope with the translation, rotation and scale problems. Besides, we also explore the spatiotemporal pyramid approach in order to capture both spatial and...

Shape and appearance context modeling

by Xiaogang Wang, Gianfranco Doretto, Thomas Sebastian, Jens Rittscher, Peter Tu - IN: PROC. ICCV (2007 , 2007
"... In this work we develop appearance models for computing the similarity between image regions containing deformable objects of a given class in realtime. We introduce the concept of shape and appearance context. The main idea is to model the spatial distribution of the appearance relative to each of ..."
Abstract - Cited by 93 (12 self) - Add to MetaCart
In this work we develop appearance models for computing the similarity between image regions containing deformable objects of a given class in realtime. We introduce the concept of shape and appearance context. The main idea is to model the spatial distribution of the appearance relative to each of the object parts. Estimating the model entails computing occurrence matrices. We introduce a generalization of the integral image and integral histogram frameworks, and prove that it can be used to dramatically speed up occurrence computation. We demonstrate the ability of this framework to recognize an individual walking across a network of cameras. Finally, we show that the proposed approach outperforms several other methods. 1.
(Show Context)

Citation Context

...ch performs poorly [6], mainly due to the failure of capturing higher-order information, such as the relative spatial distribution of the appearance labels. Some approaches address exactly this issue =-=[11, 27, 21, 19]-=-, but they mainly focus on inter-category discrimination, as opposed to recognizing specific objects. We refer to those as multi-layer approaches. In this work we propose a multi-layer appearance mode...

Nonparametric Scene Parsing via Label Transfer

by Ce Liu, Jenny Yuen, Antonio Torralba , 2011
"... While there has been a lot of recent work on object recognition and image understanding, the focus has been on carefully establishing mathematical models for images, scenes, and objects. In this paper, we propose a novel, nonparametric approach for object recognition and scene parsing using a new t ..."
Abstract - Cited by 66 (3 self) - Add to MetaCart
While there has been a lot of recent work on object recognition and image understanding, the focus has been on carefully establishing mathematical models for images, scenes, and objects. In this paper, we propose a novel, nonparametric approach for object recognition and scene parsing using a new technology we name label transfer. For an input image, our system first retrieves its nearest neighbors from a large database containing fully annotated images. Then, the system establishes dense correspondences between the input image and each of the nearest neighbors using the dense SIFT flow algorithm [28], which aligns two images based on local image structures. Finally, based on the dense scene correspondences obtained from SIFT flow, our system warps the existing annotations and integrates multiple cues in a Markov random field framework to segment and recognize the query image. Promising experimental results have been achieved by our nonparametric scene parsing system on challenging databases. Compared to existing object recognition approaches that require training classifiers or appearance models for each object category, our system is easy to implement, has few parameters, and embeds contextual information naturally in the retrieval/alignment procedure.
(Show Context)

Citation Context

...ages in either a sparse [2], [16], [19] manner by selecting the top key points containing the highest response from the feature descriptor, or densely by observing feature statistics across the image =-=[40]-=-, [51]. Sparse key point representations are often matched among pairs of images. Since the generic problem of matching two sets of key points is NP-hard, approximation algorithms have been developed ...

Proximity distribution kernels for geometric context in category recognition

by Haibin Ling , Stefano Soatto - In IEEE International Conference on Computer Vision , 2007
"... Abstract We propose using the proximity distribution of vectorquantized ..."
Abstract - Cited by 55 (3 self) - Add to MetaCart
Abstract We propose using the proximity distribution of vectorquantized
(Show Context)

Citation Context

...sequentially. Compared to [20], our approach is much simpler in that no object model is required. The most related work is the correlogram first proposed in [15] and later extended by Savarese et al. =-=[32]-=-. In [32], a correlagram is used to measure the distribution of distances between all pairs of visual labels and then applied to category classification tasks (with combination of label distribution)....

Integrated feature selection and higher-order spatial feature extraction for object categorization

by David Liu, Gang Hua, Paul Viola, Tsuhan Chen - In CVPR , 2008
"... In computer vision, the bag-of-visual words image representation has been shown to yield good results. Recent work has shown that modeling the spatial relationship between visual words further improves performance. Previous work extracts higher-order spatial features exhaustively. However, these spa ..."
Abstract - Cited by 42 (6 self) - Add to MetaCart
In computer vision, the bag-of-visual words image representation has been shown to yield good results. Recent work has shown that modeling the spatial relationship between visual words further improves performance. Previous work extracts higher-order spatial features exhaustively. However, these spatial features are expensive to compute. We propose a novel method that simultaneously performs feature selection and feature extraction. Higher-order spatial features are progressively extracted based on selected lower order ones, thereby avoiding exhaustive computation. The method can be based on any additive feature selection algorithm such as boosting. Experimental results show that the method is computationally much more efficient than previous approaches, without sacrificing accuracy. 1.
(Show Context)

Citation Context

... is to extract 2 nd order features based on previously selected 1 st order features and to progressively add them into the feature pool. descriptors image representation [3] and its recent extensions =-=[15]-=-[10][18]. Local feature descriptors are image statistics extracted from pixel neighborhoods or patches. Recent work of [15][10][18] focused on modeling the spatial relationship between pixels or patch...

Descriptive visual words and visual phrases for image applications

by Shiliang Zhang, Qi Tian, Gang Hua, Qingming Huang, Shipeng Li - Proc. ACM Multimedia , 2009
"... The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the words in texts. Howe ..."
Abstract - Cited by 37 (10 self) - Add to MetaCart
The Bag-of-visual Words (BoW) image representation has been applied for various problems in the fields of multimedia and computer vision. The basic idea is to represent images as visual documents composed of repeatable and distinctive visual elements, which are comparable to the words in texts. However, massive experiments show that the commonly used visual words are not as expressive as the text words, which is not desirable because it hinders their effectiveness in various applications. In this paper, Descriptive Visual Words (DVWs) and Descriptive Visual Phrases (DVPs) are proposed as the visual correspondences to text words and phrases, where visual phrases refer to the frequently co-occurring visual word pairs. Since images are the carriers of visual objects and scenes, novel descriptive visual element set can be composed by the visual words and their combinations which
(Show Context)

Citation Context

...10...$10.00. 1. INTRODUCTION Bag-of-visual Words (BoW) image representation has been utilized for many multimedia and vision problems, including video event detection [27, 30, 34], object recognition =-=[11, 12, 15, 18, 20, 21, 26]-=-, image segmentation [28, 31], and large-scale image retrieval [17, 22, 29], etc. Representing an image as a visual document composed of repeatable and distinctive basic visual elements that are index...

Spatial-Temporal correlatons for unsupervised action classification

by Silvio Savarese, Andrey Delpozo, Juan Carlos Niebles, Li Fei-fei
"... Spatial-temporal local motion features have shown promising results in complex human action classification. Most of the previous works [6],[16],[21] treat these spatialtemporal features as a bag of video words, omitting any long range, global information in either the spatial or temporal domain. Oth ..."
Abstract - Cited by 35 (2 self) - Add to MetaCart
Spatial-temporal local motion features have shown promising results in complex human action classification. Most of the previous works [6],[16],[21] treat these spatialtemporal features as a bag of video words, omitting any long range, global information in either the spatial or temporal domain. Other ways of learning temporal signature of motion tend to impose a fixed trajectory of the features or parts of human body returned by tracking algorithms. This leaves little flexibility for the algorithm to learn the optimal temporal pattern describing these motions. In this paper, we propose the usage of spatial-temporal correlograms to encode flexible long range temporal information into the spatial-temporal motion features. This results into a much richer description of human actions. We then apply an unsupervised generative model to learn different classes of human actions from these ST-correlograms. KTH dataset, one of the most challenging and popular human action dataset, is used for experimental evaluation. Our algorithm achieves the highest classification accuracy reported for this dataset under an unsupervised learning scheme. 1.
(Show Context)

Citation Context

...n in the temporal dimension. We introduce the idea of correlograms to capture the temporal co-occurrences patterns of the spatial-temporal features. This idea is initially proposed by Savarese et al. =-=[19]-=- for learning object classes in static images. The correlograms are used to capture the long-range spatial patterns beyond the local appearance of patches. We suggest that a similar idea can be extend...

Modeling spatial layout with Fisher vectors for image categorization

by Josip Krapac, Jakob Verbeek, Frédéric Jurie - IN: ICCV. , 2011
"... We introduce an extension of bag-of-words image representations to encode spatial layout. Using the Fisher kernel framework we derive a representation that encodes the spatial mean and the variance of image regions associated with visual words. We extend this representation by using a Gaussian mixtu ..."
Abstract - Cited by 33 (11 self) - Add to MetaCart
We introduce an extension of bag-of-words image representations to encode spatial layout. Using the Fisher kernel framework we derive a representation that encodes the spatial mean and the variance of image regions associated with visual words. We extend this representation by using a Gaussian mixture model to encode spatial layout, and show that this model is related to a soft-assign version of the spatial pyramid representation. We also combine our representation of spatial layout with the use of Fisher kernels to encode the appearance of local features. Through an extensive experimental evaluation, we show that our representation yields state-of-the-art image categorization results, while being more compact than spatial pyramid representations. In particular, using Fisher kernels to encode both appearance and spatial layout results in an image representation that is computationally efficient, compact, and yields excellent performance while using linear classifiers.
(Show Context)

Citation Context

...atures, and using absolute positions. Considering pairs of spatially close image regions is probably the most intuitive way to incorporate spatial information. Visual word “bigrams” are considered in =-=[23]-=-, by forming a bag-of-word representation over spatially neighboring image regions. Others have proposed a more effi-cient feature selection method based on boosting which progressively mines higher-...

Powered by: Apache Solr
  • About CiteSeerX
  • Submit and Index Documents
  • Privacy Policy
  • Help
  • Data
  • Source
  • Contact Us

Developed at and hosted by The College of Information Sciences and Technology

© 2007-2019 The Pennsylvania State University