Results 1 - 10
of
56
B.: Learning realistic human actions from movies
- In: CVPR. (2008
"... The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribut ..."
Abstract
-
Cited by 143 (16 self)
- Add to MetaCart
The aim of this paper is to address recognition of natural human actions in diverse and realistic video settings. This challenging but important subject has mostly been ignored in the past due to several problems one of which is the lack of realistic and annotated video datasets. Our first contribution is to address this limitation and to investigate the use of movie scripts for automatic annotation of human actions in videos. We evaluate alternative methods for action retrieval from scripts and show benefits of a text-based classifier. Using the retrieved action samples for visual learning, we next turn to the problem of action classification in video. We present a new method for video classification that builds upon and extends several recent ideas including local space-time features, space-time pyramids and multichannel non-linear SVMs. The method is shown to improve state-of-the-art results on the standard KTH action dataset by achieving 91.8 % accuracy. Given the inherent problem of noisy labels in automatic annotation, we particularly investigate and show high tolerance of our method to annotation errors in the training set. We finally apply the method to learning and classifying challenging action classes in movies and show promising results. 1.
Attribute and Simile Classifiers for Face Verification
- In IEEE International Conference on Computer Vision (ICCV
, 2009
"... We present two novel methods for face verification. Our first method – “attribute ” classifiers – uses binary classifiers trained to recognize the presence or absence of describable aspects of visual appearance (e.g., gender, race, and age). Our second method – “simile ” classifiers – removes the ma ..."
Abstract
-
Cited by 57 (7 self)
- Add to MetaCart
We present two novel methods for face verification. Our first method – “attribute ” classifiers – uses binary classifiers trained to recognize the presence or absence of describable aspects of visual appearance (e.g., gender, race, and age). Our second method – “simile ” classifiers – removes the manual labeling required for attribute classification and instead learns the similarity of faces, or regions of faces, to specific reference people. Neither method requires costly, often brittle, alignment between image pairs; yet, both methods produce compact visual descriptions, and work on real-world images. Furthermore, both the attribute and simile classifiers improve on the current state-of-the-art for the LFW data set, reducing the error rates compared to the current best by 23.92 % and 26.34%, respectively, and 31.68 % when combined. For further testing across pose, illumination, and expression, we introduce a new data set – termed PubFig – of real-world images of public figures (celebrities and politicians) acquired from the internet. This data set is both larger (60,000 images) and deeper (300 images per individual) than existing data sets of its kind. Finally, we present an evaluation of human performance. 1.
Is that you? Metric learning approaches for face identification
- In ICCV
, 2009
"... Face identification is the problem of determining whether two face images depict the same person or not. This is difficult due to variations in scale, pose, lighting, background, expression, hairstyle, and glasses. In this paper we present two methods for learning robust distance measures: (a) a log ..."
Abstract
-
Cited by 24 (4 self)
- Add to MetaCart
Face identification is the problem of determining whether two face images depict the same person or not. This is difficult due to variations in scale, pose, lighting, background, expression, hairstyle, and glasses. In this paper we present two methods for learning robust distance measures: (a) a logistic discriminant approach which learns the metric from a set of labelled image pairs (LDML) and (b) a nearest neighbour approach which computes the probability for two images to belong to the same class (MkNN). We evaluate our approaches on the Labeled Faces in the Wild data set, a large and very challenging data set of faces from Yahoo! News. The evaluation protocol for this data set defines a restricted setting, where a fixed set of positive and negative image pairs is given, as well as an unrestricted one, where faces are labelled by their identity. We are the first to present results for the unrestricted setting, and show that our methods benefit from this richer training data, much more so than the current state-of-the-art method. Our results of 79.3 % and 87.5 % correct for the restricted and unrestricted setting respectively, significantly improve over the current state-of-the-art result of 78.5%. Confidence scores obtained for face identification can be used for many applications e.g. clustering or recognition from a single training example. We show that our learned metrics also improve performance for these tasks. 1.
Movie/Script: Alignment and Parsing of Video and Text Transcription
"... Abstract. Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales “in the wild”. Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highlyvaried datasets. To enable such collection, we focus on the ta ..."
Abstract
-
Cited by 14 (3 self)
- Add to MetaCart
Abstract. Movies and TV are a rich source of diverse and complex video of people, objects, actions and locales “in the wild”. Harvesting automatically labeled sequences of actions from video would enable creation of large-scale and highlyvaried datasets. To enable such collection, we focus on the task of recovering scene structure in movies and TV series for object tracking and action retrieval. We present a weakly supervised algorithm that uses the screenplay and closed captions to parse a movie into a hierarchy of shots and scenes. Scene boundaries in the movie are aligned with screenplay scene labels and shots are reordered into a sequence of long continuous tracks or threads which allow for more accurate tracking of people, actions and objects. Scene segmentation, alignment, and shot threading are formulated as inference in a unified generative model and a novel hierarchical dynamic programming algorithm that can handle alignment and jump-limited reorderings in linear time is presented. We present quantitative and qualitative results on movie alignment and parsing, and use the recovered structure to improve character naming and retrieval of common actions in several episodes of popular TV series. 1
Learning sign language by watching TV (using weakly aligned subtitles
- In Computer Vision and Pattern Recognition
, 2009
"... The goal of this work is to automatically learn a large number of British Sign Language (BSL) signs from TV broadcasts. We achieve this by using the supervisory information available from subtitles broadcast simultaneously with the signing. This supervision is both weak and noisy: it is weak due to ..."
Abstract
-
Cited by 14 (2 self)
- Add to MetaCart
The goal of this work is to automatically learn a large number of British Sign Language (BSL) signs from TV broadcasts. We achieve this by using the supervisory information available from subtitles broadcast simultaneously with the signing. This supervision is both weak and noisy: it is weak due to the correspondence problem since temporal distance between sign and subtitle is unknown and signing does not follow the text order; it is noisy because subtitles can be signed in different ways, and because the occurrence of a subtitle word does not imply the presence of the corresponding sign. The contributions are: (i) we propose a distance function to match signing sequences which includes the trajectory of both hands, the hand shape and orientation, and properly models the case of hands touching; (ii) we show that by optimizing a scoring function based on multiple instance learning, we are able to extract the sign of interest from hours of signing footage, despite the very weak and noisy supervision. The method is automatic given the English target word of the sign to be learnt. Results are presented for 210 words including nouns, verbs and adjectives. 1.
Learning from Ambiguously Labeled Images
"... In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly, movie screenplays can tell us w ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly, movie screenplays can tell us who is in the scene, but not when and where they are on the screen. We formulate the learning problem in this setting as partially-supervised multiclass classification where each instance is labeled ambiguously with more than one label. We show theoretically that effective learning is possible under reasonable assumptions even when all the data is weakly labeled. Motivated by the analysis, we propose a general convex learning formulation based on minimization of a surrogate loss appropriate for the ambiguous label setting. We apply our framework to identifying faces culled from web news sources and to naming characters in TV series and movies. We experiment on a very large dataset consisting of 100 hours of video, and in particular achieve 6 % error for character naming on 16 episodes of LOST. 1.
Automatic Face Naming with Caption-based Supervision
- In IEEE Conf. on Computer Vision Pattern Recognition (CVPR
, 2008
"... We consider two scenarios of naming people in databases of news photos with captions: (i) finding faces of a single person, and (ii) assigning names to all faces. We combine an initial text-based step, that restricts the name assigned to a face to the set of names appearing in the caption, with a se ..."
Abstract
-
Cited by 9 (3 self)
- Add to MetaCart
We consider two scenarios of naming people in databases of news photos with captions: (i) finding faces of a single person, and (ii) assigning names to all faces. We combine an initial text-based step, that restricts the name assigned to a face to the set of names appearing in the caption, with a second step that analyzes visual features of faces. By searching for groups of highly similar faces that can be associated with a name, the results of purely text-based search can be greatly ameliorated. We improve a recent graph-based approach, in which nodes correspond to faces and edges connect highly similar faces. We introduce constraints when optimizing the objective function, and propose improvements in the low-level methods used to construct the graphs. Furthermore, we generalize the graph-based approach to face naming in the full data set. In this multi-person naming case the optimization quickly becomes computationally demanding, and we present an important speed-up using graph-flows to compute the optimal name assignments in documents. Generative models have previously been proposed to solve the multi-person naming task. We compare the generative and graph-based methods in both scenarios, and find significantly better performance using the graph-based methods in both cases. 1.
Contextual Identity Recognition in Personal Photo Albums
"... We present an efficient probabilistic method for identity recognition in personal photo albums. Personal photos are usually taken under uncontrolled conditions – the captured faces exhibit significant variations in pose, expression and illumination that limit the success of traditional face recognit ..."
Abstract
-
Cited by 9 (0 self)
- Add to MetaCart
We present an efficient probabilistic method for identity recognition in personal photo albums. Personal photos are usually taken under uncontrolled conditions – the captured faces exhibit significant variations in pose, expression and illumination that limit the success of traditional face recognition algorithms. We show how to improve recognition rates by incorporating additional cues present in personal photo collections, such as clothing appearance and information about when the photo was taken. This is done by constructing a Markov Random Field (MRF) that effectively combines all available contextual cues in a principled recognition framework. Performing inference in the MRF produces markedly improved recognition results in a challenging dataset consisting of the personal photo collections of multiple people. At the same time, the computational cost of our approach remains comparable to that of standard face recognition approaches. 1.
Automatic Annotation of Human Actions in Video
"... This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision. To this end we consider two associated problems: (a) weakly-supervised learning of action models from readily available annotations, and (b) temporal localization o ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
This paper addresses the problem of automatic temporal annotation of realistic human actions in video using minimal manual supervision. To this end we consider two associated problems: (a) weakly-supervised learning of action models from readily available annotations, and (b) temporal localization of human actions in test videos. To avoid the prohibitive cost of manual annotation for training, we use movie scripts as a means of weak supervision. Scripts, however, provide only implicit, noisy, and imprecise information about the type and location of actions in video. We address this problem with a kernel-based discriminative clustering algorithm that locates actions in the weakly-labeled training data. Using the obtained action samples, we train temporal action detectors and apply them to locate actions in the raw video data. Our experiments demonstrate that the proposed method for weakly-supervised learning of action models leads to significant improvement in action detection. We present detection results for three action classes in four feature length movies with challenging and realistic video data. 1.
Leveraging archival video for building face datasets
"... We introduce a semi-supervised method for building large, labeled datasets of faces by leveraging archival video. Specifically, we have implemented a system for labeling 11 years worth of archival footage from a television show. We have compiled a dataset of 611,770 faces, orders of magnitude larger ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
We introduce a semi-supervised method for building large, labeled datasets of faces by leveraging archival video. Specifically, we have implemented a system for labeling 11 years worth of archival footage from a television show. We have compiled a dataset of 611,770 faces, orders of magnitude larger than existing collections. It includes variation in appearance due to age, weight gain, changes in hairstyles, and other factors difficult to observe in smaller-scale collections. Face recognition in an uncontrolled setting can be difficult. We argue (and demonstrate) that there is much structure at varying timescales in the video data that make recognition much easier. At local time scales, one can use motion and tracking to group face images together- we may not know the identity, but we know a single label applies to all faces in a track. At medium time scales (say, within a scene), one can use appearance features such as hair and clothing to group tracks across shot boundaries. However, at longer timescales (say, across episodes), one can no longer use clothing as a cue. This suggests that one needs to carefully encode representations of appearance, depending on the timescale at which one intends to match. We assemble our final dataset by classifying groups of tracks in a nearest-neighbors framework. We use a face library obtained by labeling track clusters in a reference episode. We show that this classification is significantly easier when exploiting the hierarchical structure naturally present in the video sequences. From a data-collection point of view, tracking is vital because it adds non-frontal poses to our face collection. This is important because we know of no other method for collecting images of non-frontal faces “in the wild”. 1.

