Results 1 - 10
of
13
Zero-shot learning through cross-modal transfer
- In International Conference on Learning Representations (ICLR
, 2013
"... This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate betw ..."
Abstract
-
Cited by 34 (1 self)
- Add to MetaCart
(Show Context)
This work introduces a model that can recognize objects in images even if no training data is available for the object class. The only necessary knowledge about unseen visual categories comes from unsupervised text corpora. Unlike previous zero-shot learning models, which can only differentiate between unseen classes, our model can operate on a mixture of seen and unseen classes, simultaneously obtaining state of the art performance on classes with thousands of training im-ages and reasonable performance on unseen classes. This is achieved by seeing the distributions of words in texts as a semantic space for understanding what ob-jects look like. Our deep learning model does not require any manually defined semantic or visual features for either words or images. Images are mapped to be close to semantic word vectors corresponding to their classes, and the resulting image embeddings can be used to distinguish whether an image is of a seen or un-seen class. We then use novelty detection methods to differentiate unseen classes from seen classes. We demonstrate two novelty detection strategies; the first gives high accuracy on unseen classes, while the second is conservative in its prediction of novelty and keeps the seen classes ’ accuracy high. 1
Going Beyond Text: A Hybrid Image-Text Approach for Measuring Word Relatedness
"... Traditional approaches to semantic relatedness are often restricted to text-based methods, which typically disregard other multimodal knowledge sources. In this paper, we propose a novel image-based metric to estimate the relatedness of words, and demonstrate the promise of this method through compa ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
(Show Context)
Traditional approaches to semantic relatedness are often restricted to text-based methods, which typically disregard other multimodal knowledge sources. In this paper, we propose a novel image-based metric to estimate the relatedness of words, and demonstrate the promise of this method through comparative evaluations on three standard datasets. We also show that a hybrid image-text approach can lead to improvements in word relatedness, confirming the applicability of visual cues as a possible orthogonal information source. 1
Automatic caption generation for news images
- IEEE TRANS. PATTERN ANAL. MACH. INTELL
, 2013
"... This paper is concerned with the task of automatically generating captions for images, which is important for many image-related applications. Examples include video and image retrieval as well as the development of tools that aid visually impaired individuals to access pictorial information. Our a ..."
Abstract
-
Cited by 8 (0 self)
- Add to MetaCart
This paper is concerned with the task of automatically generating captions for images, which is important for many image-related applications. Examples include video and image retrieval as well as the development of tools that aid visually impaired individuals to access pictorial information. Our approach leverages the vast resource of pictures available on the web and the fact that many of them are captioned and colocated with thematically related documents. Our model learns to create captions from a database of news articles, the pictures embedded in them, and their captions, and consists of two stages. Content selection identifies what the image and accompanying article are about, whereas surface realization determines how to verbalize the chosen content. We approximate content selection with a probabilistic image annotation model that suggests keywords for an image. The model postulates that images and their textual descriptions are generated by a shared set of latent variables (topics) and is trained on a weakly labeled dataset (which treats the captions and associated news articles as image labels). Inspired by recent work in summarization, we propose extractive and abstractive surface realization models. Experimental results show that it is viable to generate captions that are pertinent to the specific content of an image and its associated article, while permitting creativity in the description. Indeed, the output of our abstractive model compares favorably to handwritten captions and is often superior to extractive methods.
Using Visual Information to Predict Lexical Preference
"... Most NLP systems make predictions based solely on linguistic (textual or spoken) input. We show how to use visual information to make better linguistic predictions. We focus on selectional preference; specifically, determining the plausible noun arguments for particular verb predicates. For each arg ..."
Abstract
-
Cited by 7 (0 self)
- Add to MetaCart
Most NLP systems make predictions based solely on linguistic (textual or spoken) input. We show how to use visual information to make better linguistic predictions. We focus on selectional preference; specifically, determining the plausible noun arguments for particular verb predicates. For each argument noun, we extract visual features from corresponding images on the web. For each verb predicate, we train a classifier to select the visual features that are indicative of its preferred arguments. We show that for certain verbs, using visual information can significantly improve performance over a baseline. For the successful cases, visual information is useful even in the presence of cooccurrence information derived from webscale text. We assess a variety of training configurations, which vary over classes of visual features, methods of image acquisition, and numbers of images. 1
Measuring the semantic relatedness between words and images
"... Measures of similarity have traditionally focused on computing the semantic relatedness between pairs of words and texts. In this paper, we construct an evaluation framework to quantify cross-modal semantic relationships that exist between arbitrary pairs of words and images. We study the effectiven ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
Measures of similarity have traditionally focused on computing the semantic relatedness between pairs of words and texts. In this paper, we construct an evaluation framework to quantify cross-modal semantic relationships that exist between arbitrary pairs of words and images. We study the effectiveness of a corpus-based approach to automatically derive the semantic relatedness between words and images, and perform empirical evaluations by measuring its correlation with human annotators. 1
Distributional Semantics with Eyes: Using Image Analysis to Improve Computational Representations of Word Meaning
, 2012
"... The current trend in image analysis and multimedia is to use information extracted from text and text processing techniques to help vision-related tasks, such as automated image annotation and generating semantically rich descriptions of images. In this work, we claim that image analysis techniques ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
The current trend in image analysis and multimedia is to use information extracted from text and text processing techniques to help vision-related tasks, such as automated image annotation and generating semantically rich descriptions of images. In this work, we claim that image analysis techniques can “return the favor” to the text processing community and be successfully used for a general-purpose representation of word meaning. We provide evidence that simple low-level visual features can enrich the semantic representation of word meaning with information that cannot be extracted from text alone, leading to improvement in the core task of estimating degrees of semantic relatedness between words, as well as providing a new, perceptually-enhanced angle on word semantics. Additionally, we show how distinguishing between a concept and its context in images can improve the quality of the word meaning representations extracted from images.
Towards a semantics for distributional representations
- In Proceedings of the 10th International Conference on Computational Semantics (IWCS 2013
, 2013
"... ..."
Is this a wampimuk? Cross-modal mapping between distributional semantics and the visual world
"... Abstract Following up on recent work on establishing a mapping between vector-based semantic embeddings of words and the visual representations of the corresponding objects from natural images, we first present a simple approach to cross-modal vector-based semantics for the task of zero-shot learni ..."
Abstract
- Add to MetaCart
(Show Context)
Abstract Following up on recent work on establishing a mapping between vector-based semantic embeddings of words and the visual representations of the corresponding objects from natural images, we first present a simple approach to cross-modal vector-based semantics for the task of zero-shot learning, in which an image of a previously unseen object is mapped to a linguistic representation denoting its word. We then introduce fast mapping, a challenging and more cognitively plausible variant of the zero-shot task, in which the learner is exposed to new objects and the corresponding words in very limited linguistic contexts. By combining prior linguistic and visual knowledge acquired about words and their objects, as well as exploiting the limited new evidence available, the learner must learn to associate new objects with words. Our results on this task pave the way to realistic simulations of how children or robots could use existing knowledge to bootstrap grounded semantic knowledge about new concepts.
AUTOMATIC CAPTION GENERATION FOR NEWS IMAGES USING EXTRACTIVE AND ABSTRACTIVE MODELS
"... Automatic image caption generation is of great interest to many image related applications. Now a day’s, whenever retrieving images from the search Engines that retrieves images without analyzing their content, simply by matching user queries against the image’s file name and format, user-annotated ..."
Abstract
- Add to MetaCart
Automatic image caption generation is of great interest to many image related applications. Now a day’s, whenever retrieving images from the search Engines that retrieves images without analyzing their content, simply by matching user queries against the image’s file name and format, user-annotated tags, captions, and, generally, text surrounding the image. Also the retrieved image does not contain any textual data along with the images. We introduced the task of automatic caption generation for news images. The task fuses insights from computer vision and natural language processing and holds promise for various multimedia applications, such as image retrieval, development of tools supporting news media management, and for individuals with visual impairment. It is possible to learn a caption generation model from weakly labelled data without costly manual involvement. Instead of manually creating annotations, image captions are treated as labels for the image. Although the caption words are admittedly noisy compared to traditional human-created keywords, we show that they can be used to learn the correspondences between visual and textual modalities, and also serve as a gold standard for the caption generation task. We have presented extractive and abstractive caption generation models. A key aspect of our approach is to allow both the visual and textual modalities to influence the generation task.
Automatic Caption Generation for News Images
"... Abtract- This thesis is concerned with the task of automatically generating captions for images, which is important for many image related applications. Our model learns to create captions from publicly available dataset that has not been explicitly labelled for our task. A dataset consists of news ..."
Abstract
- Add to MetaCart
Abtract- This thesis is concerned with the task of automatically generating captions for images, which is important for many image related applications. Our model learns to create captions from publicly available dataset that has not been explicitly labelled for our task. A dataset consists of news articles, the pictures embedded in them, and their captions, and consists of two stages. First stage consists of content selection which identifies what the image and accompanying article are about, whereas second stage surface realization determines how to put the chosen content in a proper grammatical caption. For content selection, we are using probabilistic image annotation model that suggests keywords for an image. This model postulates that images and their textual descriptions are generated by a shared set of latent variables (topics) and is trained on a weakly labeled dataset