Results 1 - 10
of
67
A convolutional neural network for modelling sentences.
- In Proceedings of the 52th Annual Meeting of the Association for Computational Linguistics.
, 2014
"... Abstract The ability to accurately represent sentences is central to language understanding. We describe a convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of sentences. The network uses Dynamic k-Max Pooling, a global poolin ..."
Abstract
-
Cited by 59 (2 self)
- Add to MetaCart
(Show Context)
Abstract The ability to accurately represent sentences is central to language understanding. We describe a convolutional architecture dubbed the Dynamic Convolutional Neural Network (DCNN) that we adopt for the semantic modelling of sentences. The network uses Dynamic k-Max Pooling, a global pooling operation over linear sequences. The network handles input sentences of varying length and induces a feature graph over the sentence that is capable of explicitly capturing short and long-range relations. The network does not rely on a parse tree and is easily applicable to any language. We test the DCNN in four experiments: small scale binary and multi-class sentiment prediction, six-way question classification and Twitter sentiment prediction by distant supervision. The network achieves excellent performance in the first three tasks and a greater than 25% error reduction in the last task with respect to the strongest baseline.
Deep visual-semantic alignments for generating image descriptions
, 2014
"... We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combinati ..."
Abstract
-
Cited by 47 (0 self)
- Add to MetaCart
We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.
A Fast and Accurate Dependency Parser using Neural Networks
"... Almost all current dependency parsers classify based on millions of sparse indi-cator features. Not only do these features generalize poorly, but the cost of feature computation restricts parsing speed signif-icantly. In this work, we propose a novel way of learning a neural network classifier for u ..."
Abstract
-
Cited by 47 (3 self)
- Add to MetaCart
(Show Context)
Almost all current dependency parsers classify based on millions of sparse indi-cator features. Not only do these features generalize poorly, but the cost of feature computation restricts parsing speed signif-icantly. In this work, we propose a novel way of learning a neural network classifier for use in a greedy, transition-based depen-dency parser. Because this classifier learns and uses just a small number of dense fea-tures, it can work very fast, while achiev-ing an about 2 % improvement in unla-beled and labeled attachment scores on both English and Chinese datasets. Con-cretely, our parser is able to parse more than 1000 sentences per second at 92.2% unlabeled attachment score on the English Penn Treebank. 1
Show and tell: A neural image caption generator
, 2014
"... Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep re-current architecture that combines recent advances in computer vision an ..."
Abstract
-
Cited by 32 (2 self)
- Add to MetaCart
Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In this paper, we present a generative model based on a deep re-current architecture that combines recent advances in computer vision and machine translation and that can be used to generate natural sentences describing an image. The model is trained to maximize the likelihood of the target description sentence given the training image. Experiments on several datasets show the accuracy of the model and the fluency of the language it learns solely from image descriptions. Our model is often quite accurate, which we verify both qualitatively and quantitatively. For instance, while the current state-of-the-art BLEU score (the higher the better) on the Pascal dataset is 25, our approach yields 59, to be compared to human performance around 69. We also show BLEU score improvements on Flickr30k, from 55 to 66, and on SBU, from 19 to 27.
Deep fragment embeddings for bidirectional image sentence mapping
- In arXiv:1406.5679
, 2014
"... We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike pre-vious models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of im ..."
Abstract
-
Cited by 29 (2 self)
- Add to MetaCart
We introduce a model for bidirectional retrieval of images and sentences through a deep, multi-modal embedding of visual and natural language data. Unlike pre-vious models that directly map images or sentences into a common embedding space, our model works on a finer level and embeds fragments of images (ob-jects) and fragments of sentences (typed dependency tree relations) into a com-mon space. We then introduce a structured max-margin objective that allows our model to explicitly associate these fragments across modalities. Extensive exper-imental evaluation shows that reasoning on both the global level of images and sentences and the finer level of their respective fragments improves performance on image-sentence retrieval tasks. Additionally, our model provides interpretable predictions for the image-sentence retrieval task since the inferred inter-modal alignment of fragments is explicit. 1
Multimodal Neural Language Models
"... We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as gener ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
We introduce two multimodal neural language models: models of natural language that can be conditioned on other modalities. An image-text multimodal neural language model can be used to retrieve images given complex sentence queries, retrieve phrase descriptions given image queries, as well as generate text conditioned on images. We show that in the case of image-text modelling we can jointly learn word representa-tions and image features by training our models together with a convolutional network. Unlike many of the existing methods, our approach can generate sentence descriptions for images with-out the use of templates, structured prediction, and/or syntactic trees. While we focus on image-text modelling, our algorithms can be easily ap-plied to other modalities such as audio. 1.
Unifying visual-semantic embeddings with multimodal neural language models
"... Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embed-ding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effecti ..."
Abstract
-
Cited by 26 (4 self)
- Add to MetaCart
(Show Context)
Inspired by recent advances in multimodal learning and machine translation, we introduce an encoder-decoder pipeline that learns (a): a multimodal joint embed-ding space with images and text and (b): a novel language model for decoding distributed representations from our space. Our pipeline effectively unifies joint image-text embedding models with multimodal neural language models. We in-troduce the structure-content neural language model that disentangles the structure of a sentence to its content, conditioned on representations produced by the en-coder. The encoder allows one to rank images and sentences while the decoder can generate novel descriptions from scratch. Using LSTM to encode sentences, we match the state-of-the-art performance on Flickr8K and Flickr30K without using object detections. We also set new best results when using the 19-layer Ox-ford convolutional network. Furthermore we show that with linear encoders, the learned embedding space captures multimodal regularities in terms of vector space arithmetic e.g. *image of a blue car *- "blue " + "red " is near images of red cars. Sample captions generated for 800 images are made available for comparison. 1
Explain images with multimodal recurrent neural networks
, 2014
"... In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by samp ..."
Abstract
-
Cited by 20 (1 self)
- Add to MetaCart
In this paper, we present a multimodal Recurrent Neural Network (m-RNN) model for generating novel sentence descriptions to explain the content of images. It directly models the probability distribution of generating a word given previous words and the image. Image descriptions are generated by sampling from this distribution. The model consists of two sub-networks: a deep recurrent neural network for sentences and a deep convolutional network for images. These two sub-networks interact with each other in a multimodal layer to form the whole m-RNN model. The effectiveness of our model is validated on three benchmark datasets (IAPR TC-12 [8], Flickr 8K [28], and Flickr 30K [13]). Our model out-performs the state-of-the-art generative method. In addition, the m-RNN model can be applied to retrieval tasks for retrieving images or sentences, and achieves significant performance improvement over the state-of-the-art methods which directly optimize the ranking objective function for retrieval.
From Captions to Visual Concepts and Back
, 2014
"... This paper presents a novel approach for automatically generating image descriptions: visual detectors and language models learn directly from a dataset of image captions. We use Multiple Instance Learning to train visual detectors for words that commonly occur in captions, including many different ..."
Abstract
-
Cited by 15 (1 self)
- Add to MetaCart
This paper presents a novel approach for automatically generating image descriptions: visual detectors and language models learn directly from a dataset of image captions. We use Multiple Instance Learning to train visual detectors for words that commonly occur in captions, including many different parts of speech such as nouns, verbs, and adjectives. The word detector outputs serve as conditional inputs to a maximum-entropy language model. The language model learns from a set of over 400,000 image descriptions to capture the statistics of word usage. We capture global semantics by re-ranking caption candidates using sentence-level features and a deep multimodal similarity model. When human judges compare the system captions to ones written by other people, the system captions have equal or better quality over 23 % of the time.