DMCA
Aligning Books and Movies: Towards Story-like Visual Explanations by Watching Movies and Reading Books
Citations: | 2 - 1 self |
Citations
2222 | Bleu: A Meth-od for Automatic Evaluation of Machine Translation
- Papineni, Roukos, et al.
- 2002
(Show Context)
Citation Context ...ute similarities between each shot in the movie and each sentence in the book. For dialogs, we use several similarity measures each capturing a different level of semantic similarity. We compute BLEU =-=[23]-=- between each subtitle and book sentence to identify nearly identical matches. Similarly to [34], we use a tf-idf measure to find near duplicates but weighing down the influence of the less frequent w... |
453 | Long short-term memory.
- Hochreiter, Schmidhuber
- 1997
(Show Context)
Citation Context ...etwork, inspired by the success of encoder-decoder models for neural machine translation [10, 2, 1, 31]. Two kinds of activation functions have recently gained traction: long short-term memory (LSTM) =-=[9]-=- and the gated recurrent unit (GRU) [3]. Both types of activation successfully solve the vanishing gradient problem, through the use of gates to control the flow of information. The LSTM unit explicit... |
371 | Efficient estimation of word representations in vector space
- Mikolov, Chen, et al.
(Show Context)
Citation Context ... from the sentence embedding model. For each query sentence on the left, we retrieve the 4 nearest neighbor sentences (by inner product) chosen from books the model has not seen before. the skip-gram =-=[22]-=- architecture for learning representations of words. In the word skip-gram model, a word wi is chosen and must predict its surrounding context (e.g. wi+1 and wi−1 for a context window of size 1). Our ... |
142 | Every picture tells a story: Generating sentences from images.
- Farhadi, Hejrati, et al.
- 2010
(Show Context)
Citation Context ...Most effort in the domain of vision and language has been devoted to the problem of image captioning. Older work made use of fixed visual representations and translated them into textual descriptions =-=[6, 16]-=-. Recently, several approaches based on RNNs emerged, generating captions via a learned joint image-text embedding [13, 11, 36, 21]. These approaches have also been extended to generate descriptions o... |
99 | Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers.
- Gupta, Davis
- 2008
(Show Context)
Citation Context ...le captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions =-=[8, 26]-=-, and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure 1: Shot from the movie Gone Girl, along with the subtitle, aligned with the book. We reason abo... |
81 | Baby talk: Understanding and generating simple image descriptions
- Kulkarni, Premraj, et al.
- 2011
(Show Context)
Citation Context ...Most effort in the domain of vision and language has been devoted to the problem of image captioning. Older work made use of fixed visual representations and translated them into textual descriptions =-=[6, 16]-=-. Recently, several approaches based on RNNs emerged, generating captions via a learned joint image-text embedding [13, 11, 36, 21]. These approaches have also been extended to generate descriptions o... |
76 | Sequence to Sequence Learning with Neural Networks
- Sutskever, Vinyals, et al.
- 2014
(Show Context)
Citation Context ...s not considered in this paper, which is explored in [14]. To construct an encoder, we use a recurrent neural network, inspired by the success of encoder-decoder models for neural machine translation =-=[10, 2, 1, 31]-=-. Two kinds of activation functions have recently gained traction: long short-term memory (LSTM) [9] and the gated recurrent unit (GRU) [3]. Both types of activation successfully solve the vanishing g... |
59 | Neural machine translation by jointly learning to align and translate. The Computing Research Repository,
- Bahdanau, Cho, et al.
- 2014
(Show Context)
Citation Context ...s not considered in this paper, which is explored in [14]. To construct an encoder, we use a recurrent neural network, inspired by the success of encoder-decoder models for neural machine translation =-=[10, 2, 1, 31]-=-. Two kinds of activation functions have recently gained traction: long short-term memory (LSTM) [9] and the gated recurrent unit (GRU) [3]. Both types of activation successfully solve the vanishing g... |
57 | Who are you?” Learning person specific classifiers from video
- Sivic, Everingham, et al.
- 2009
(Show Context)
Citation Context ...em which identified sets of visual and audio features to align movies and scripts without making use of the subtitles. Such alignment has been exploited to provide weak labels for person naming tasks =-=[5, 30, 25]-=-. Closest to our work is [34], which aligns plot synopses to shots in the TV series for story-based content retrieval. This work adopts a similarity function between sentences in plot synopses and sho... |
53 | Recurrent continuous translation models.
- Kalchbrenner, Blunsom
- 2013
(Show Context)
Citation Context ...s not considered in this paper, which is explored in [14]. To construct an encoder, we use a recurrent neural network, inspired by the success of encoder-decoder models for neural machine translation =-=[10, 2, 1, 31]-=-. Two kinds of activation functions have recently gained traction: long short-term memory (LSTM) [9] and the gated recurrent unit (GRU) [3]. Both types of activation successfully solve the vanishing g... |
47 | Movie/script: Alignment and parsing of video and text transcription. In:
- Cour, Jordan, et al.
- 2008
(Show Context)
Citation Context ...beddings with soft attention in order to align the words to image regions. Early work on movie-to-text alignment include dynamic time warping for aligning movies to scripts with the help of subtitles =-=[5, 4]-=-. Sankar et al. [28] further developed a system which identified sets of visual and audio features to align movies and scripts without making use of the subtitles. Such alignment has been exploited to... |
47 | Deep Visual-Semantic Alignments for Generating Image Descriptions.
- Karpathy, Fei-Fei
- 2015
(Show Context)
Citation Context ...otten significant attention in the past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning =-=[13, 11, 36, 35, 21]-=-, alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure ... |
46 |
Going deeper with convolutions. arXiv preprint arXiv:1409.4842,
- Szegedy, Liu, et al.
- 2014
(Show Context)
Citation Context ...his dataset has 94 movies, and 54,000 described clips. We represent each movie clip as a vector corresponding to mean-pooled features across each frame in the clip. We used the GoogLeNet architecture =-=[32]-=- as well as hybrid-CNN [38] for extracting frame features. For DVS, we pre-processed the descriptions by removing names and replacing these with a someone token. The LSTM architecture in this work is ... |
46 | Learning deep features for scene recognition using places database.
- Zhou, Lapedriza, et al.
- 2014
(Show Context)
Citation Context ...and 54,000 described clips. We represent each movie clip as a vector corresponding to mean-pooled features across each frame in the clip. We used the GoogLeNet architecture [32] as well as hybrid-CNN =-=[38]-=- for extracting frame features. For DVS, we pre-processed the descriptions by removing names and replacing these with a someone token. The LSTM architecture in this work is implemented using the follo... |
43 | Microsoft COCO: Common objects in context.
- Lin, Maire, et al.
- 2014
(Show Context)
Citation Context ...important for applications such as social robotics or assistive driving. Combining images or videos with language has gotten significant attention in the past year, partly due to the creation of CoCo =-=[18]-=-, Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment [11, 15, 34], Q&A [20, 19], visual model learning fro... |
38 | Learning Phrase Representations using RNN Encoder Decoder for Statistical Machine Translation”. In: EMNLP
- Cho, Merrienboer, et al.
- 2014
(Show Context)
Citation Context ...s not considered in this paper, which is explored in [14]. To construct an encoder, we use a recurrent neural network, inspired by the success of encoder-decoder models for neural machine translation =-=[10, 2, 1, 31]-=-. Two kinds of activation functions have recently gained traction: long short-term memory (LSTM) [9] and the gated recurrent unit (GRU) [3]. Both types of activation successfully solve the vanishing g... |
32 | Show and tell: A neural image caption generator.
- Vinyals, Toshev, et al.
- 2015
(Show Context)
Citation Context ...otten significant attention in the past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning =-=[13, 11, 36, 35, 21]-=-, alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure ... |
26 |
Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980,
- Kingma, Ba
- 2014
(Show Context)
Citation Context ...ces conditioned on the representation of the encoder:∑ t logP (wti+1|w<ti+1,hi) + ∑ t logP (wti−1|w<ti−1,hi) (10) The total objective is the above summed over all such training tuples. Adam algorithm =-=[12]-=- is used for optimization. 4.2. Visual-semantic embeddings of clips and DVS The model above describes how to obtain a similarity score between two sentences, whose representations are learned from mil... |
26 | Unifying visual-semantic embeddings with multimodal neural language models,” TACL,
- Kiros, Salakhutdinov, et al.
- 2015
(Show Context)
Citation Context ...otten significant attention in the past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning =-=[13, 11, 36, 35, 21]-=-, alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure ... |
26 | Efficient Structured Prediction with Latent Variables for General Graphical Models
- Schwing, Hazan, et al.
- 2012
(Show Context)
Citation Context ... to further speed up computation. Learning. Since ground-truth is only available for a sparse set of shots, we regard the states of unobserved nodes as hidden variables and learn the CRF weights with =-=[29]-=-. 5. Experimental Evaluation We evaluate our model on our dataset of 11 movie/book pairs. We train the parameters in our model (CNN and CRF) on Gone Girl, and test our performance on the remaining 10 ... |
20 | Explain images with multimodal recurrent neural networks,” NIPS Deep Learning Workshop,
- Mao, Xu, et al.
- 2014
(Show Context)
Citation Context ...otten significant attention in the past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning =-=[13, 11, 36, 35, 21]-=-, alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure ... |
15 |
attend and tell: Neural image caption generation with visual attention.
- Show
- 2015
(Show Context)
Citation Context ...ouns and pronouns in a caption and visual objects using several visual and textual potentials. Lin et al. [17] does so for videos. In [11], the authors use RNN embeddings to find the correspondences. =-=[37]-=- combines neural embeddings with soft attention in order to align the words to image regions. Early work on movie-to-text alignment include dynamic time warping for aligning movies to scripts with the... |
14 |
A.: Hello! My Name is
- Everingham, Sivic, et al.
(Show Context)
Citation Context ...beddings with soft attention in order to align the words to image regions. Early work on movie-to-text alignment include dynamic time warping for aligning movies to scripts with the help of subtitles =-=[5, 4]-=-. Sankar et al. [28] further developed a system which identified sets of visual and audio features to align movies and scripts without making use of the subtitles. Such alignment has been exploited to... |
13 | What are you talking about? text-to-image coreference
- Kong, Lin, et al.
- 2014
(Show Context)
Citation Context ...he past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment =-=[11, 15, 34]-=-, Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure 1: Shot from the movie G... |
13 | Translating videos to natural language using deep recurrent neural networks. - Venugopalan, Xu, et al. - 2015 |
12 |
Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.
- Chung, Gulcehre, et al.
- 2014
(Show Context)
Citation Context ...der-decoder models for neural machine translation [10, 2, 1, 31]. Two kinds of activation functions have recently gained traction: long short-term memory (LSTM) [9] and the gated recurrent unit (GRU) =-=[3]-=-. Both types of activation successfully solve the vanishing gradient problem, through the use of gates to control the flow of information. The LSTM unit explicity employs a cell that acts as a carouse... |
12 | Visual semantic search: Retrieving videos via complex textual queries
- Lin, Fidler, et al.
- 2014
(Show Context)
Citation Context ...s such as captioning [13, 11, 36, 35, 21], alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries =-=[17]-=-. ∗Denotes equal contribution Figure 1: Shot from the movie Gone Girl, along with the subtitle, aligned with the book. We reason about the visual and dialog (text) alignment between the movie and a bo... |
11 | A multiworld approach to question answering about realworld scenes based on uncertain input.
- Malinowski, Fritz
- 2014
(Show Context)
Citation Context ...ly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment [11, 15, 34], Q&A =-=[20, 19]-=-, visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure 1: Shot from the movie Gone Girl, alon... |
10 | A sentence is worth a thousand pixels
- Fidler, Sharma, et al.
- 2013
(Show Context)
Citation Context ...descriptions of short video clips [35]. In [24], the authors go beyond describing what is happening in an image and provide explanations about why something is happening. For text-to-image alignment, =-=[15, 7]-=- find correspondences between nouns and pronouns in a caption and visual objects using several visual and textual potentials. Lin et al. [17] does so for videos. In [11], the authors use RNN embedding... |
10 | Video event understanding using natural language descriptions
- Ramanathan, Liang, et al.
- 2013
(Show Context)
Citation Context ...le captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment [11, 15, 34], Q&A [20, 19], visual model learning from textual descriptions =-=[8, 26]-=-, and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure 1: Shot from the movie Gone Girl, along with the subtitle, aligned with the book. We reason abo... |
8 | Skip-thought vectors. - Kiros, Zhu, et al. - 2015 |
6 | Linking people in videos with their names using coreference resolution
- Ramanathan, Joulin, et al.
- 2014
(Show Context)
Citation Context ...em which identified sets of visual and audio features to align movies and scripts without making use of the subtitles. Such alignment has been exploited to provide weak labels for person naming tasks =-=[5, 30, 25]-=-. Closest to our work is [34], which aligns plot synopses to shots in the TV series for story-based content retrieval. This work adopts a similarity function between sentences in plot synopses and sho... |
3 | A dataset for movie description.
- Rohrbach, Rohrbach, et al.
- 2015
(Show Context)
Citation Context ...esence of characters in a scene to those in a chapter, as well as uses hand-crafted similarity measures between sentences in the subtitles and dialogs in the books, similarly to [34]. Rohrbach et al. =-=[27]-=- recently released the Movie Description dataset which contains clips from movies, each time-stamped with a sentence from DVS (Descriptive Video Service). The dataset contains clips from over a 100 mo... |
3 |
Subtitle-free Movie to Script Alignment
- Sankar, Jawahar, et al.
- 2009
(Show Context)
Citation Context ...ention in order to align the words to image regions. Early work on movie-to-text alignment include dynamic time warping for aligning movies to scripts with the help of subtitles [5, 4]. Sankar et al. =-=[28]-=- further developed a system which identified sets of visual and audio features to align movies and scripts without making use of the subtitles. Such alignment has been exploited to provide weak labels... |
2 | Don’t just listen, use your imagination: Leveraging visual common sense for non-visual tasks. arXiv preprint arXiv:1502.06108
- Lin, Parikh
- 2015
(Show Context)
Citation Context ...ly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment [11, 15, 34], Q&A =-=[20, 19]-=-, visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure 1: Shot from the movie Gone Girl, alon... |
2 |
Book2Movie: Aligning Video scenes with Book chapters
- Tapaswi, Bauml, et al.
- 2015
(Show Context)
Citation Context ...d might vary in the storyline from their movie release. Furthermore, we use learned neural embeddings to compute the similarities rather than hand-designed similarity functions. Parallel to our work, =-=[33]-=- aims to align scenes in movies to chapters in the book. However, their approach operates on a very coarse level (chapters), while ours does so on the sentence/paragraph level. Their dataset thus eval... |
1 |
Inferring the why in images. arXiv.org
- Pirsiavash, Vondrick, et al.
- 2014
(Show Context)
Citation Context ...s based on RNNs emerged, generating captions via a learned joint image-text embedding [13, 11, 36, 21]. These approaches have also been extended to generate descriptions of short video clips [35]. In =-=[24]-=-, the authors go beyond describing what is happening in an image and provide explanations about why something is happening. For text-to-image alignment, [15, 7] find correspondences between nouns and ... |
1 |
Aligning Plot Synopses to Videos for Story-based Retrieval
- Tapaswi, Buml, et al.
(Show Context)
Citation Context ...he past year, partly due to the creation of CoCo [18], Microsoft’s large-scale captioned image dataset. The field has tackled a diverse set of tasks such as captioning [13, 11, 36, 35, 21], alignment =-=[11, 15, 34]-=-, Q&A [20, 19], visual model learning from textual descriptions [8, 26], and semantic visual search with natural multisentence queries [17]. ∗Denotes equal contribution Figure 1: Shot from the movie G... |