Results 1 - 10
of
101
Object, scene and actions: combining multiple features for human action recognition
- In ECCV
, 2010
"... Abstract. In many cases, human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. In this paper, we look into this problem and propose an approach for human action recognition that integrat ..."
Abstract
-
Cited by 74 (1 self)
- Add to MetaCart
(Show Context)
Abstract. In many cases, human actions can be identified not only by the singular observation of the human body in motion, but also properties of the surrounding scene and the related objects. In this paper, we look into this problem and propose an approach for human action recognition that integrates multiple feature channels from several entities such as objects, scenes and people. We formulate the problem in a multiple instance learning (MIL) framework, based on multiple feature channels. By using a discriminative approach, we join multiple feature channels embedded to the MIL space. Our experiments over the large YouTube dataset show that scene and object information can be used to complement person features for human action recognition. 1
Multi-instance learning by treating instances as nonI.I.D. samples
- In Proceedings of the 26th International Conference on Machine Learning
, 2009
"... Previous studies on multi-instance learning typically treated instances in the bags as independently and identically distributed. The instances in a bag, however, are rarely independent in real tasks, and a better performance can be expected if the instances are treated in an non-i.i.d. way that exp ..."
Abstract
-
Cited by 43 (5 self)
- Add to MetaCart
(Show Context)
Previous studies on multi-instance learning typically treated instances in the bags as independently and identically distributed. The instances in a bag, however, are rarely independent in real tasks, and a better performance can be expected if the instances are treated in an non-i.i.d. way that exploits relations among instances. In this paper, we propose two simple yet effective methods. In the first method, we explicitly map every bag to an undirected graph and design a graph kernel for distinguishing the positive and negative bags. In the second method, we implicitly construct graphs by deriving affinity matrices and propose an efficient graph kernel considering the clique information. The effectiveness of the proposed methods are validated by experiments. 1.
Action Recognition from One Example
, 2009
"... We present a novel action recognition method based on space-time locally adaptive regression kernels and the matrix cosine similarity measure. The proposed method uses a single example of an action to find similar matches. It does not require prior knowledge about actions; foreground/background segm ..."
Abstract
-
Cited by 29 (1 self)
- Add to MetaCart
We present a novel action recognition method based on space-time locally adaptive regression kernels and the matrix cosine similarity measure. The proposed method uses a single example of an action to find similar matches. It does not require prior knowledge about actions; foreground/background segmentation, or any motion estimation or tracking. Our method is based on the computation of novel space-time descriptors from a query video, which measure the likeness of a voxel to its surroundings. Salient features are extracted from said descriptors and compared against analogous features from the target video. This comparison is done using a matrix generalization of the cosine similarity measure. The algorithm yields a scalar resemblance volume, with each voxel indicating the likelihood of similarity between the query video and all cubes in the target video. Using nonparametric significance tests and non-maxima suppression, we detect the presence and location of actions similar to the query video. High performance is demonstrated on challenging sets of action data containing fast motions, varied contexts, and even when multiple complex actions occur simultaneously within the field of view. Further experiments on the Weizmann and KTH datasets demonstrate state-of-the-art performance in action categorization, despite the use of only a single example.
Deep fragment embeddings for bidirectional image sentence mapping
- In arXiv:1406.5679
, 2014
"... We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike pre-vious models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of im ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
(Show Context)
We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike pre-vious models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (ob-jects) and fragments of sentences (typed dependency tree relations) into a com-mon space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive exper-imental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit. 1
Discriminative Segment Annotation in Weakly Labeled Video
"... This paper tackles the problem of segment annotation in complex Internet videos. Given a weakly labeled video, we automatically generate spatiotemporal masks for each of the concepts with which it is labeled. This is a particularly relevant problem in the video domain, as large numbers of Internet v ..."
Abstract
-
Cited by 24 (3 self)
- Add to MetaCart
(Show Context)
This paper tackles the problem of segment annotation in complex Internet videos. Given a weakly labeled video, we automatically generate spatiotemporal masks for each of the concepts with which it is labeled. This is a particularly relevant problem in the video domain, as large numbers of Internet videos are now available, tagged with the visual concepts that they contain. Given such weakly labeled videos, we focus on the problem of spatiotemporal segment classification. We propose a straightforward algorithm, CRANE, that utilizes large amounts of weakly labeled video to rank spatiotemporal segments by the likelihood that they correspond to a given visual concept. We make publicly available segment-level annotations for a subset of the Prest et al. dataset [20] and show convincing results. We also show state-of-the-art results on Hartmann et al.’s more difficult, large-scale object segmentation dataset [11]. Spatiotemporal segmentation Semantic object segmentation Figure 1. Output of our system. Given a weakly tagged video (e.g., “dog”) [top], we first perform unsupervised spatiotemporal segmentation [middle]. Our method identifies segments that correspond to the label to generate a semantic segmentation [bottom]. 1.
MILIS: Multiple Instance Learning with Instance Selection
, 2010
"... Multiple-instance learning (MIL) is a paradigm in supervised learning that deals with the classification of collections of instances called bags. Each bag contains a number of instances from which features are extracted. The complexity of MIL is largely dependent on the number of instances in the tr ..."
Abstract
-
Cited by 20 (6 self)
- Add to MetaCart
Multiple-instance learning (MIL) is a paradigm in supervised learning that deals with the classification of collections of instances called bags. Each bag contains a number of instances from which features are extracted. The complexity of MIL is largely dependent on the number of instances in the training data set. Since we are usually confronted with a large instance space even for moderately sized real-world data sets applications, it is important to design efficient instance selection techniques to speed up the training process without compromising the performance. In this paper, we address the issue of instance selection in MIL. We propose MILIS, a novel MIL algorithm based on adaptive instance selection. We do this in an alternating optimisation framework by intertwining the steps of instance selection and classifier learning in an iterative manner which is guaranteed to converge. Initial instance selection is achieved by a simple yet effective kernel density estimator on the negative instances. Experimental results demonstrate the utility and efficiency of the proposed approach as compared to the state-of-the-art.
MIForests: Multiple-instance learning with randomized trees.
- In ECCV,
, 2010
"... Abstract. Multiple-instance learning (MIL) allows for training classifiers from ambiguously labeled data. In computer vision, this learning paradigm has been recently used in many applications such as object classification, detection and tracking. This paper presents a novel multiple-instance learn ..."
Abstract
-
Cited by 19 (3 self)
- Add to MetaCart
(Show Context)
Abstract. Multiple-instance learning (MIL) allows for training classifiers from ambiguously labeled data. In computer vision, this learning paradigm has been recently used in many applications such as object classification, detection and tracking. This paper presents a novel multiple-instance learning algorithm for randomized trees called MIForests. Randomized trees are fast, inherently parallel and multi-class and are thus increasingly popular in computer vision. MIForest combine the advantages of these classifiers with the flexibility of multiple instance learning. In order to leverage the randomized trees for MIL, we define the hidden class labels inside target bags as random variables. These random variables are optimized by training random forests and using a fast iterative homotopy method for solving the non-convex optimization problem. Additionally, most previously proposed MIL approaches operate in batch or off-line mode and thus assume access to the entire training set. This limits their applicability in scenarios where the data arrives sequentially and in dynamic environments. We show that MIForests are not limited to off-line problems and present an on-line extension of our approach. In the experiments, we evaluate MIForests on standard visual MIL benchmark datasets where we achieve state-of-theart results while being faster than previous approaches and being able to inherently solve multi-class problems. The on-line version of MIForests is evaluated on visual object tracking where we outperform the state-of-theart method based on boosting.
Web image mining towards universal age estimator
- in Proceedings of the seventeen ACM international conference on Multimedia, 2009
"... In this paper, we present an automatic web image mining system towards building a universal human age estimator based on facial information, which is applicable to all ethnic groups and various image qualities. First, a large (∼391k) yet noisy human aging im-age dataset is crawled from the photo sha ..."
Abstract
-
Cited by 15 (3 self)
- Add to MetaCart
(Show Context)
In this paper, we present an automatic web image mining system towards building a universal human age estimator based on facial information, which is applicable to all ethnic groups and various image qualities. First, a large (∼391k) yet noisy human aging im-age dataset is crawled from the photo sharing website Flickr and Google image search engine based on a set of human age related text queries. Then, within each image, several human face detectors of different implementations are used for robust face detection, and all the detected faces with multiple responses are considered as the multiple instances of a bag (image). An outlier removal step with Principal Component Analysis further refines the image set to about 220k faces, and then a robust multi-instance regressor learning al-gorithm is proposed to learn the kernel-regression based human age estimator under the scenarios with possibly noisy bags. The proposed system has the following characteristics: 1) no manual human age labeling process is required, and the age information is automatically obtained from the age related queries, 2) the derived human age estimator is universal owing to the diversity and rich-ness of Internet images and thus has good generalization capabil-ity, and 3) the age estimator learning process is robust to the noises existing in both Internet images and corresponding age labels. This automatically derived human age estimator is extensively evaluated on three popular benchmark human aging databases, and without taking any images from these benchmark databases as training sam-ples, comparable age estimation accuracies with the state-of-the-art results are achieved.
Weakly Supervised Learning of Object Segmentations from Web-Scale Video
"... Abstract. We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog” ..."
Abstract
-
Cited by 14 (0 self)
- Add to MetaCart
(Show Context)
Abstract. We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as “dog”, without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube. 1