Results 1 -
9 of
9
A Dataset for Movie Description
- In CVPR
, 2015
"... Descriptive video service (DVS) provides linguistic de-scriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter-esting data source for computer vision and computational linguistic ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
Descriptive video service (DVS) provides linguistic de-scriptions of movies and allows visually impaired people to follow a movie along with their peers. Such descriptions are by design mainly visual and thus naturally form an inter-esting data source for computer vision and computational linguistics. In this work we propose a novel dataset which contains transcribed DVS, which is temporally aligned to full length HD movies. In addition we also collected the aligned movie scripts which have been used in prior work and compare the two different sources of descriptions. In total the Movie Description dataset contains a parallel cor-pus of over 54,000 sentences and video snippets from 72 HD movies. We characterize the dataset by benchmark-ing different approaches for generating video descriptions. Comparing DVS to scripts, we find that DVS is far more visual and describes precisely what is shown rather than what should happen according to the scripts created prior to movie production. 1.
Weakly-Supervised Alignment of Video With Text
, 2015
"... HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte p ..."
Abstract
-
Cited by 2 (0 self)
- Add to MetaCart
(Show Context)
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Advancing Human Pose and Gesture Recognition
, 2015
"... This thesis presents new methods in two closely related areas of computer vision: human pose estimation, and gesture recognition in videos. In human pose estimation, we show that random forests can be used to estimate human pose in monocular videos. To this end, we propose a co-segmentation al-gorit ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
(Show Context)
This thesis presents new methods in two closely related areas of computer vision: human pose estimation, and gesture recognition in videos. In human pose estimation, we show that random forests can be used to estimate human pose in monocular videos. To this end, we propose a co-segmentation al-gorithm for segmenting humans out of videos, and an evaluator that predicts whether the estimated poses are correct or not. We further extend this pose estimator to new domains (with a transfer learning approach), and enhance its predictions by predicting the joint positions sequentially (rather than indepen-dently) in an image, and using temporal information in the videos (rather than predicting the poses from a single frame). Finally, we go beyond random forests, and show that convolutional neural networks can be used to estimate human pose even more accurately and eciently. We propose two new convolutional neural network architectures, and show how optical flow can be employed in convolu-tional nets to further improve the predictions. In gesture recognition, we explore the idea of using weak supervision to learn gestures. We show that we can learn sign language automatically from signed TV broadcasts with subtitles by letting algorithms ‘watch ’ the TV broadcasts and ‘match ’ the signs with the subtitles. We further show that if even a small amount of strong supervision is available (as there is for sign language, in the form of sign language video dictionaries), this strong supervision can be combined with weak supervision to learn even better models. 1
Temporally Coherent Interpretations for Long Videos Using Pattern Theory
"... Graph-theoretical methods have successfully provided semantic and structural interpretations of images and videos. A recent paper introduced a pattern-theoretic ap-proach that allows construction of flexible graphs for rep-resenting interactions of actors with objects and inference is accomplished b ..."
Abstract
- Add to MetaCart
(Show Context)
Graph-theoretical methods have successfully provided semantic and structural interpretations of images and videos. A recent paper introduced a pattern-theoretic ap-proach that allows construction of flexible graphs for rep-resenting interactions of actors with objects and inference is accomplished by an efficient annealing algorithm. Ac-tions and objects are termed generators and their interac-tions are termed bonds; together they form high-probability configurations, or interpretations, of observed scenes. This work and other structural methods have generally been lim-ited to analyzing short videos involving isolated actions. Here we provide an extension that uses additional temporal bonds across individual actions to enable semantic inter-pretations of longer videos. Longer temporal connections improve scene interpretations as they help discard (tem-porally) local solutions in favor of globally superior ones. Using this extension, we demonstrate improvements in un-derstanding longer videos, compared to individual inter-pretations of non-overlapping time segments. We verified the success of our approach by generating interpretations for more than 700 video segments from the YouCook data set, with intricate videos that exhibit cluttered background, scenarios of occlusion, viewpoint variations and changing conditions of illumination. Interpretations for long video segments were able to yield performance increases of about 70 % and, in addition, proved to be more robust to different severe scenarios of classification errors. 1.
5.3. FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem 6
"... Vision, perception and multimedia interpretation Table of contents ..."
(Show Context)
Multi-observation Face Recognition in Videos based on Label Propagation
"... In order to deal with the huge amount of content gener-ated by social media, especially for indexing and retrieval purposes, the focus shifted from single object recognition to multi-observation object recognition. Of particular inter-est is the problem of face recognition (used as primary cue for p ..."
Abstract
- Add to MetaCart
(Show Context)
In order to deal with the huge amount of content gener-ated by social media, especially for indexing and retrieval purposes, the focus shifted from single object recognition to multi-observation object recognition. Of particular inter-est is the problem of face recognition (used as primary cue for persons ’ identity assessment), since it is highly required by popular social media search engines like Facebook and Youtube. Recently, several approaches for graph-based la-bel propagation were proposed. However, the associated graphs were constructed in an ad-hoc manner (e.g., using the KNN graph) that cannot cope properly with the rapid and frequent changes in data appearance, a phenomenon intrinsically related with video sequences. In this paper, we propose a novel approach for efficient and adaptive graph construction, based on a two-phase scheme: (i) the first phase is used to adaptively find the neighbors of a sample and also to find the adequate weights for the minimization function of the second phase; (ii) in the second phase, the selected neighbors along with their corresponding weights are used to locally and collaboratively estimate the sparse affinity matrix weights. Experimental results performed on Honda Video Database (HVDB) and a subset of video sequences extracted from the popular TV-series ’Friends’ show a distinct advantage of the proposed method over the existing standard graph construction methods. 1.
Watch-n-Patch: Unsupervised Understanding of Actions and Relations
"... We focus on modeling human activities comprising mul-tiple actions in a completely unsupervised setting. Our model learns the high-level action co-occurrence and tem-poral relations between the actions in the activity video. We consider the video as a sequence of short-term action clips, called acti ..."
Abstract
- Add to MetaCart
(Show Context)
We focus on modeling human activities comprising mul-tiple actions in a completely unsupervised setting. Our model learns the high-level action co-occurrence and tem-poral relations between the actions in the activity video. We consider the video as a sequence of short-term action clips, called action-words, and an activity is about a set of action-topics indicating which actions are present in the video. Then we propose a new probabilistic model relat-ing the action-words and the action-topics. It allows us to model long-range action relations that commonly exist in the complex activity, which is challenging to capture in the previous works. We apply our model to unsupervised action segmentation and recognition, and also to a novel application that detects forgotten actions, which we call action patching. For evalu-ation, we also contribute a new challenging RGB-D activity video dataset recorded by the new Kinect v2, which con-tains several human daily activities as compositions of mul-tiple actions interacted with different objects. The extensive experiments show the effectiveness of our model. 1.