Results 1 - 10
of
16
Recognizing Realistic Actions from Videos “in the Wild”
"... In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild. ” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous ..."
Abstract
-
Cited by 47 (8 self)
- Add to MetaCart
In this paper, we present a systematic framework for recognizing realistic actions from videos “in the wild. ” Such unconstrained videos are abundant in personal collections as well as on the web. Recognizing action from such videos has not been addressed extensively, primarily due to the tremendous variations that result from camera motion, background clutter, changes in object appearance, and scale, etc. The main challenge is how to extract reliable and informative features from the unconstrained videos. We extract both motion and static features from the videos. Since the raw features of both types are dense yet noisy, we propose strategies to prune these features. We use motion statistics to acquire stable motion features and clean static features. Furthermore, PageRank is used to mine the most informative static features. In order to further construct compact yet discriminative visual vocabularies, a divisive information-theoretic algorithm is employed to group semantically related features. Finally, AdaBoost is chosen to integrate all the heterogeneous yet complementary features for recognition. We have tested the framework on the KTH dataset and our own dataset consisting of 11 categories of actions collected from YouTube and personal videos, and have obtained impressive results for action recognition and action localization. 1.
Class Segmentation and Object Localization with Superpixel Neighborhoods
"... We propose a method to identify and localize object classes in images. Instead of operating at the pixel level, we advocate the use of superpixels as the basic unit of a class segmentation or pixel localization scheme. To this end, we construct a classifier on the histogram of local features found i ..."
Abstract
-
Cited by 26 (3 self)
- Add to MetaCart
We propose a method to identify and localize object classes in images. Instead of operating at the pixel level, we advocate the use of superpixels as the basic unit of a class segmentation or pixel localization scheme. To this end, we construct a classifier on the histogram of local features found in each superpixel. We regularize this classifier by aggregating histograms in the neighborhood of each superpixel and then refine our results further by using the classifier in a conditional random field operating on the superpixel graph. Our proposed method exceeds the previously published state-of-the-art on two challenging datasets: Graz-02 and the PASCAL VOC 2007 Segmentation
Learning Semantic Visual Vocabularies Using Diffusion Distance
"... In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this mi ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
In this paper, we propose a novel approach for learning generic visual vocabulary. We use diffusion maps to automatically learn a semantic visual vocabulary from abundant quantized midlevel features. Each midlevel feature is represented by the vector of pointwise mutual information (PMI). In this midlevel feature space, we believe the features produced by similar sources must lie on a certain manifold. To capture the intrinsic geometric relations between features, we measure their dissimilarity using diffusion distance. The underlying idea is to embed the midlevel features into a semantic lower-dimensional space. Our goal is to construct a compact yet discriminative semantic visual vocabulary. Although the conventional approach using k-means is good for vocabulary construction, its performance is sensitive to the size of the visual vocabulary. In addition, the learnt visual words are not semantically meaningful since the clustering criterion is based on appearance similarity only. Our proposed approach can effectively overcome these problems by capturing the semantic and geometric relations of the feature space using diffusion maps. Unlike some of the supervised vocabulary construction approaches, and the unsupervised methods such as pLSA and LDA, diffusion maps can capture the local intrinsic geometric relations between the midlevel feature points on the manifold. We have tested our approach on the KTH action dataset, our own YouTube action dataset and the fifteen scene dataset, and have obtained very promising results.
Portmanteau Vocabularies for Multi-Cue Image Representation
"... We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is mo ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
We describe a novel technique for feature combination in the bag-of-words model of image classification. Our approach builds discriminative compound words from primitive cues learned independently from training images. Our main observation is that modeling joint-cue distributions independently is more statistically robust for typical classification problems than attempting to empirically estimate the dependent, joint-cue distribution directly. We use Information theoretic vocabulary compression to find discriminative combinations of cues and the resulting vocabulary of portmanteau 1 words is compact, has the cue binding property, and supports individual weighting of cues in the final image representation. State-of-theart results on both the Oxford Flower-102 and Caltech-UCSD Bird-200 datasets demonstrate the effectiveness of our technique compared to other, significantly more complex approaches to multi-cue image representation. 1
Improving Bag-of-Features Action Recognition with Non-local Cues
"... Local space-time features have recently shown promising results within Bag-of-Features (BoF) approach to action recognition in video. Pure local features and descriptors, however, provide only limited discriminative power implying ambiguity among features and sub-optimal classification performance. ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Local space-time features have recently shown promising results within Bag-of-Features (BoF) approach to action recognition in video. Pure local features and descriptors, however, provide only limited discriminative power implying ambiguity among features and sub-optimal classification performance. In this work, we propose to disambiguate local space-time features and to improve action recognition by integrating additional nonlocal cues with BoF representation. For this purpose, we decompose video into region classes and augment local features with corresponding region-class labels. In particular, we investigate unsupervised and supervised video segmentation using (i) motion-based foreground segmentation, (ii) person detection, (iii) static action detection and (iv) object detection. While such segmentation methods might be imperfect, they provide complementary region-level information to local features. We demonstrate how this information can be integrated with BoF representations in a kernel-combination framework. We evaluate our method on the recent and challenging Hollywood-2 action dataset and demonstrate significant improvements.
Categorization in Natural Time-Varying Image Sequences
"... Approaches to single image categorization do not easily generalize to natural time-varying image sequences. In natural environments, object categories tend to have few features that help to distinguish between each other and the surrounding environment. To better discriminate between categories and ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Approaches to single image categorization do not easily generalize to natural time-varying image sequences. In natural environments, object categories tend to have few features that help to distinguish between each other and the surrounding environment. To better discriminate between categories and the surrounding environment, we propose a multi-view categorization approach that exploits the statistics of image sequences rather than single images. The approach is unbiased towards redundant views – that is, it does not matter how many times an object appears from the same viewpoint. At the same time, the approach does not penalize for missing views, so that we do not have to capture an object at all viewpoints to successfully categorize the object. We first present a data set for studying natural environment monitoring: an image sequence of birds at a feeder station. After manual localization, a baseline bag of features approach was found to perform significantly worse on the proposed data set compared to the standard Caltech 101 data set. We find that our approach increases the categorization accuracy from 48 % to 58 % on average when compared to an equivalent single view categorization method. Finally, we show how the same metric proposed for the supervised categorization can be used to transform, in an unsupervised manner, an image sequence into a manageable set of categories. 1.
Efficient Object Pixel-Level Categorization using Bag of Features
"... Abstract. In this paper we present a pixel-level object categorization method suitable to be applied under real-time constraints. Since pixels are categorized using a bag of features scheme, the major bottleneck of such an approach would be the feature pooling in local histograms of visual words. Th ..."
Abstract
- Add to MetaCart
Abstract. In this paper we present a pixel-level object categorization method suitable to be applied under real-time constraints. Since pixels are categorized using a bag of features scheme, the major bottleneck of such an approach would be the feature pooling in local histograms of visual words. Therefore, we propose to bypass this time-consuming step and directly obtain the score of a linear Support Vector Machine classifier. This is achieved by creating an integral image of the components of the SVM which can readily obtain the classification score for any image sub-window with only 10 additions and 2 products, regardless of its size. Besides, we evaluated the performance of two efficient feature quantization methods: the Hierarchical K-Means and the Extremely Randomized Forest. All experiments have been done in the Graz02 database, showing comparable, or even better results to related work with a lower computational cost. 1
Human Action Recognition using Pyramid Vocabulary Tree
"... Abstract. The bag-of-visual-words (BOVW) approaches are widely used in human action recognition. Usually, large vocabulary size of the BOVW is more discriminative for inter-class action classification while small one is more robust to noise and thus tolerant to the intra-class invariance. In this pa ..."
Abstract
- Add to MetaCart
Abstract. The bag-of-visual-words (BOVW) approaches are widely used in human action recognition. Usually, large vocabulary size of the BOVW is more discriminative for inter-class action classification while small one is more robust to noise and thus tolerant to the intra-class invariance. In this pape, we propose a pyramid vocabulary tree to model local spatio-temporal features, which can characterize the inter-class difference and also allow intra-class variance. Moreover, since BOVW is geometrically unconstrained, we further consider the spatio-temporal information of local features and propose a sparse spatio-temporal pyramid matching kernel (termed as SST-PMK) to compute the similarity measures between video sequences. SST-PMK satisfies the Mercer’s condition and therefore is readily integrated into SVM to perform action recognition. Experimental results on the Weizmann datasets show that both the pyramid vocabulary tree and the SST-PMK lead to a significant improvement in human action recognition. Keywords: Action recognition, Bag-of-visual-words (BOVW), Pyramid matching kernel (PMK)
Fast and Robust Object Segmentation with the Integral Linear Classifier
"... We propose an efficient method, built on the popular Bag of Features approach, that obtains robust multiclass pixellevel object segmentation of an image in less than 500ms, with results comparable or better than most state of the art methods. We introduce the Integral Linear Classifier (ILC), that c ..."
Abstract
- Add to MetaCart
We propose an efficient method, built on the popular Bag of Features approach, that obtains robust multiclass pixellevel object segmentation of an image in less than 500ms, with results comparable or better than most state of the art methods. We introduce the Integral Linear Classifier (ILC), that can readily obtain the classification score for any image sub-window with only 6 additions and 1 product by fusing the accumulation and classification steps in a single operation. In order to design a method as efficient as possible, our building blocks are carefully selected from the quickest in the state of the art. More precisely, we evaluate the performance of three popular local descriptors, that can be very efficiently computed using integral images, and two fast quantization methods: the Hierarchical K-Means, and the Extremely Randomized Forest. Finally, we explore the utility of adding spatial bins to the Bag of Features histograms and that of cascade classifiers to improve the obtained segmentation. Our method is compared to the state of the art in the difficult Graz-02 and PASCAL 2007 Segmentation Challenge datasets. 1.
PoseEstimationfor CategorySpecific Multiview Object Localization ∗
"... Weproposeanapproachtoovercomethetwomainchallenges of 3D multiview object detection and localization: Thevariationof objectfeaturesduetochangesinthe viewpoint and the variation in the size and aspect ratio of the object. Our approach proceeds in three steps. Given an initial bounding box of fixed siz ..."
Abstract
- Add to MetaCart
Weproposeanapproachtoovercomethetwomainchallenges of 3D multiview object detection and localization: Thevariationof objectfeaturesduetochangesinthe viewpoint and the variation in the size and aspect ratio of the object. Our approach proceeds in three steps. Given an initial bounding box of fixed size, we first refine its aspect ratio and size. We can then predict the viewing angle, under the hypothesis that the boundingbox actually contains an object instance. Finally, a classifier tuned to this particular viewpoint checks the existence of an instance. As a result, we can find the object instances and estimate their poses, without having to search over all window sizes and potentialorientations. We train and evaluate our method on a new object databasespecificallytailoredfor thistask, containingrealworld objects imaged over a wide range of smoothly varying viewpoints and significant lighting changes. We show thatthesuccessive estimationsof theboundingboxandthe viewpointleadtobetterlocalizationresults. 1.

