Results 1 -
9 of
9
Learning query-class dependent weights in automatic video retrieval
- In Proceedings of the 12th annual ACM international conference on Multimedia
, 2004
"... Combining retrieval results from multiple modalities plays a crucial role for video retrieval systems, especially for automatic video retrieval systems without any user feedback and query expansion. However, most of current systems only utilize query independent combination or rely on explicit user ..."
Abstract
-
Cited by 46 (13 self)
- Add to MetaCart
Combining retrieval results from multiple modalities plays a crucial role for video retrieval systems, especially for automatic video retrieval systems without any user feedback and query expansion. However, most of current systems only utilize query independent combination or rely on explicit user weighting. In this work, we propose using query-class dependent weights within a hierarchial mixture-of-expert framework to combine multiple retrieval results. We first classify each user query into one of the four predefined categories and then aggregate the retrieval results with query-class associated weights, which can be learned from the development data efficiently and generalized to the unseen queries easily. Our experimental results demonstrate that the performance with query-class dependent weights can considerably surpass that with the query independent weights.
Joint visual-text modeling for automatic retrieval of multimedia documents
- In MULTIMEDIA ’05: Proceedings of the 13th annual ACM international conference on Multimedia
, 2005
"... In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queri ..."
Abstract
-
Cited by 12 (0 self)
- Add to MetaCart
In this paper we describe a novel approach for jointly modeling the text and the visual components of multimedia documents for the purpose of information retrieval(IR). We propose a novel framework where individual components are developed to model different relationships between documents and queries and then combined into a joint retrieval framework. In the state-of-the-art systems, a late combination between two independent systems, one analyzing just the text part of such documents, and the other analyzing the visual part without leveraging any knowledge acquired in the text processing, is the norm. Such systems rarely exceed the performance of any single modality (i.e. text or video) in information retrieval tasks. Our experiments indicate that allowing a rich interaction between the modalities results in significant improvement in performance over any single modality. We demonstrate these results using the TRECVID03 corpus, which comprises 120 hours of broadcast news videos. Our results demonstrate over 14 % improvement in IR performance over the best reported textonly baseline and ranks amongst the best results reported on this corpus.
LyricAlly: Automatic Synchronization of Textual Lyrics to Acoustic Music Signals
"... Abstract—We present LyricAlly, a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this problem based on a multimodal approach, using an appropriate pairing of audio and text processing to ..."
Abstract
-
Cited by 4 (0 self)
- Add to MetaCart
Abstract—We present LyricAlly, a prototype that automatically aligns acoustic musical signals with their corresponding textual lyrics, in a manner similar to manually-aligned karaoke. We tackle this problem based on a multimodal approach, using an appropriate pairing of audio and text processing to create the resulting prototype. LyricAlly’s acoustic signal processing uses standard audio features but constrained and informed by the musical nature of the signal. The resulting detected hierarchical rhythm structure is utilized in singing voice detection and chorus detection to produce results of higher accuracy and lower computational costs than their respective baselines. Text processing is employed to approximate the length of the sung passages from the lyrics. Results show an average error of less than one bar for per-line alignment of the lyrics on a test bed of 20 songs (sampled from CD audio and carefully selected for variety). We perform a comprehensive set of system-wide and per-component tests and discuss their results. We conclude by outlining steps for further development. Index Terms—Acoustic signal detection, acoustic signal processing, music, text processing. I.
Video Database Modeling and Temporal Pattern Retrieval using Hierarchical Markov Model Mediator
- In Proc. of the First IEEE International Workshop on Multimedia Databases and Data Management (IEEE-MDDM), in conjunction with IEEE International Conference on Data Engineering (ICDE), April 8, 2006
, 2006
"... The dream of pervasive multimedia retrieval and reuse will not be realized without incorporating semantics in the multimedia database. As video data is penetrating many information systems, the need for database support for video data evolves. Hence, we propose an innovative database modeling mechan ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
The dream of pervasive multimedia retrieval and reuse will not be realized without incorporating semantics in the multimedia database. As video data is penetrating many information systems, the need for database support for video data evolves. Hence, we propose an innovative database modeling mechanism called Hierarchical Markov Model Mediator (HMMM) which integrates lowlevel features, semantic concepts, and high-level user perceptions for modeling and indexing multiple-level video objects to facilitate temporal pattern retrieval. Different from the existing database modeling methods, our approach carries a stochastic and dynamic process in both search and similarity calculation. In the retrieval of semantic event patterns, HMMM always tries to traverse the right path and therefore it can assist in retrieving more accurate patterns quickly with lower computational costs. Moreover, HMMM supports feedbacks and learning strategies, which can proficiently assure the continuous improvements of the overall performance. 1.
T.S.: Multi-faceted contextual model for person identification in news video. In: Multimedia Modeling
, 2006
"... Person identification is very important in the domain of multimedia news as it is often the focus of events in news stories and interest of searchers. However, this detection is impeded by the imprecise audio/visual analysis tools. In this paper, we describe a multimodal and multi-faceted approach t ..."
Abstract
-
Cited by 2 (1 self)
- Add to MetaCart
Person identification is very important in the domain of multimedia news as it is often the focus of events in news stories and interest of searchers. However, this detection is impeded by the imprecise audio/visual analysis tools. In this paper, we describe a multimodal and multi-faceted approach to Person-X detection in news video. We make use of multimodal features extracted from text, visual and audio inherent in news video. We also incorporate multiple external sources of news from web and parallel news archives to extract location and temporal profile of the persons. We call this second source of information the multi-faceted context. The multimodal, multi-faceted information is then fused using a RankBoosting approach. Experiments on TRECVID 2003 and 2004 search queries demonstrate that our approach is effective. 1.
CLVQ: Cross-language video question/answering system
- In Proceedings of 6th IEEE International Symposium on Multimedia Software Engineering
, 2004
"... Multi-Language information retrieval promotes users to browse documents in the form of their mother language, and more and more peoples interested in retrieves short answers rather than a full document. In this paper, we present a cross-language video QA system i.e. CLVQ, which could process the Eng ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Multi-Language information retrieval promotes users to browse documents in the form of their mother language, and more and more peoples interested in retrieves short answers rather than a full document. In this paper, we present a cross-language video QA system i.e. CLVQ, which could process the English questions, and find answers in Chinese videos. The main contribution of this research are: (1) the application of QA technology into different media; and (2) adopt a new answer finding approach without human-made rules; (3) the combination of several techniques of passage retrieval algorithms. The experimental result shows 56 % of answer finding. The testing collection was consists of six Discovery movies, and questions are from the School of Discovery web site. 1.
Story Tracking in Video News Broadcasts
, 2004
"... Since the invention of television, and later the Internet, the amount of video content available has been growing rapidly. The great mass of visual material is an invaluable source of information, but its usefulness is limited by the available means of accessing and tailoring it to the needs of an i ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Since the invention of television, and later the Internet, the amount of video content available has been growing rapidly. The great mass of visual material is an invaluable source of information, but its usefulness is limited by the available means of accessing and tailoring it to the needs of an individual. Long experience with text as a medium of conveying information allowed us to develop relatively effective methods of dealing with textual data. Unfortunately, the currently available techniques of accessing and processing video data are largely inadequate to the needs of its potential users. Hence video material remains a valuable but grossly untapped resource. In the domain of video news sources, this problem is especially severe. Television news stations broadcast continuous up-to-the-minute information from around the globe. For any individual viewer, only small portions of this news stream is of interest, yet currently no methods exist which would allow him to filter and monitor only the interesting news.
Natural Language Querying for Video Databases
"... The video databases have become popular in various areas due to the recent advances in technology. Video archive systems need user-friendly interfaces to retrieve video frames. In this paper, a user interface based on natural language processing (NLP) to a video database system is described. The vid ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
The video databases have become popular in various areas due to the recent advances in technology. Video archive systems need user-friendly interfaces to retrieve video frames. In this paper, a user interface based on natural language processing (NLP) to a video database system is described. The video database is based on a content-based spatio-temporal video data model. The data model is focused on the semantic content which includes objects, activities, and spatial properties of objects. Spatio-temporal relationships between video objects and also trajectories of moving objects can be queried with this data model. In this video database system, a natural language interface enables flexible querying. The queries, which are given as English sentences, are parsed using Link Parser. The semantic representations of the queries are extracted from their syntactic structures using information extraction techniques. The extracted semantic representations are used to call the related parts of the underlying video database system to return the results of the queries. Not only exact matches but similar objects and activities are also returned from the database with the help of the conceptual ontology module. This module is implemented using a distance-based method of semantic similarity search on the semantic domain-independent ontology, WordNet.
Semantic Multi-modal Analysis, Structuring, and Visualization for Candid Personal Interaction Videos
"... Videos are rich in multimedia content and semantics, which should be used by video browsers to better present the audio-visual information to the viewer. Ubiquitous video players allow for content to be scanned linearly, rarely providing summaries or methods for searching. Through analysis of audio ..."
Abstract
- Add to MetaCart
Videos are rich in multimedia content and semantics, which should be used by video browsers to better present the audio-visual information to the viewer. Ubiquitous video players allow for content to be scanned linearly, rarely providing summaries or methods for searching. Through analysis of audio and video tracks, it is possible to extract text transcripts from audio, displayed text from video, and higher-level semantics through speaker identification and scene analysis. External data sources, when available, can be used to cross-reference the video content and impose a structure for organization. Various research tools have addressed video summarization and browsing using one or more of these modalities; however, most of them assume edited videos as input. We focus our research on genres in personal interaction videos and collections of such videos in their unedited form. We present and verify formal models for their structure, and develop methods for their automatic analysis, summarization and indexing. We specify the characteristic semantic components of three related genres of candidly captured videos: formal instructions or lectures, student team project presentations, and discussions. For each genre, we design and

