Results 1 - 10
of
167
Image retrieval: ideas, influences, and trends of the new age
- ACM COMPUTING SURVEYS
, 2008
"... We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger ass ..."
Abstract
-
Cited by 464 (13 self)
- Add to MetaCart
(Show Context)
We have witnessed great interest and a wealth of promise in content-based image retrieval as an emerging technology. While the last decade laid foundation to such promise, it also paved the way for a large number of new techniques and systems, got many new people involved, and triggered stronger association of weakly related fields. In this article, we survey almost 300 key theoretical and empirical contributions in the current decade related to image retrieval and automatic image annotation, and in the process discuss the spawning of related subfields. We also discuss significant challenges involved in the adaptation of existing image retrieval techniques to build systems that can be useful in the real world. In retrospect of what has been achieved so far, we also conjecture what the future may hold for image retrieval research.
Video Suggestion and Discovery for YouTube: Taking Random Walks Through the View Graph
"... The rapid growth of the number of videos in YouTube provides enormous potential for users to find content of interest to them. Unfortunately, given the difficulty of searching videos, the size of the video repository also makes the discovery of new content a daunting task. In this paper, we present ..."
Abstract
-
Cited by 93 (6 self)
- Add to MetaCart
(Show Context)
The rapid growth of the number of videos in YouTube provides enormous potential for users to find content of interest to them. Unfortunately, given the difficulty of searching videos, the size of the video repository also makes the discovery of new content a daunting task. In this paper, we present a novel method based upon the analysis of the entire user–video graph to provide personalized video suggestions for users. The resulting algorithm, termed Adsorption, provides a simple method to efficiently propagate preference information through a variety of graphs. We extensively test the results of the recommendations on a three month snapshot of live data from YouTube.
A.: The semantic pathfinder: Using an authoring metaphor for generic multimedia indexing
- IEEE Trans. Pat. Anal. Mach. Intell
, 2006
"... ..."
(Show Context)
Informedia at TRECVID 2003: Analyzing and searching broadcast news video
- In Proc. of TRECVID
, 2003
"... We submitted a number of semantic classifiers, most of which were merely trained on keyframes. We also experimented with runs of classifiers were trained exclusively on text data and relative time within the video, while a few were trained using all available multiple modalities. 1.2 Interactive sea ..."
Abstract
-
Cited by 55 (21 self)
- Add to MetaCart
(Show Context)
We submitted a number of semantic classifiers, most of which were merely trained on keyframes. We also experimented with runs of classifiers were trained exclusively on text data and relative time within the video, while a few were trained using all available multiple modalities. 1.2 Interactive search This year, we submitted two runs using different versions of the Informedia systems. In one run, a version identical to last year's interactive system was used by five researchers, who split up the topics between themselves. The system interface emphasizes text queries, allowing search across ASR, closed captions and OCR text. The result set can then be manipulated through: • storyboards of images spanning across video story segments • emphasizing matching shots to a user’s query to reduce the image count to a manageable size • resolution and layout under user control • additional filtering provided through shot classifiers such as outdoors, and shots with people, etc. • display of filter count and distribution to guide their use in manipulating storyboard views. In the best-performing interactive run, for all topics a single researcher used an improved version of the system, which allowed more effective browsing and visualization of the results of text queries using a
Multimodal fusion for multimedia analysis: a survey
, 2010
"... This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several c ..."
Abstract
-
Cited by 52 (1 self)
- Add to MetaCart
This survey aims at providing multimedia researchers with a state-of-the-art overview of fusion strategies, which are used for combining multiple modalities in order to accomplish various multimedia analysis tasks. The existing literature on multimodal fusion research is presented through several classifications based on the fusion methodology and the level of fusion (feature, decision, and hybrid). The fusion methods are described from the perspective of the basic concept, advantages, weaknesses, and their usage in various analysis tasks as reported in the literature. Moreover, several distinctive issues that influence a multimodal fusion process such as, the use of correlation and independence, confidence level, contextual information, synchronization between different modalities, and the optimal modality selection are also highlighted. Finally, we present the open issues for further research in the area of multimodal fusion.
I2T: Image Parsing to Text Description
"... In this paper, we present an image parsing to text generation (I2T) framework that generates natural language descriptions from image and video content. This framework converts the harder content based image and video retrieval problem into an easier text search problem with potential applications ..."
Abstract
-
Cited by 52 (2 self)
- Add to MetaCart
(Show Context)
In this paper, we present an image parsing to text generation (I2T) framework that generates natural language descriptions from image and video content. This framework converts the harder content based image and video retrieval problem into an easier text search problem with potential applications in Internet search and visual data mining. The proposed I2T framework follows three steps. 1) Input images or video frames are decomposed into their constituent visual patterns through an image parsing engine, which outputs a scene as a parse graph representation, in a spirit similar to parsing sentences in speech and natural language. 2) The parse graphs are converted into semantic representation using the Web Ontology Language (OWL) format, which is a formal and unambiguous knowledge representation. 3) A text generation engine converts the semantic representation into a semantically meaningful, human readable and query-able text report. Success of the above framework relies on two knowledge bases. The first one is a visual knowledge base that provides top-down hypotheses for image parsing and serves as an image ontology for translating parse graphs into semantic representations. The core of the visual knowledge base is an And-Or graph representation. It entails vocabularies of visual elements including pixels, primitives, parts, objects and scenes and a stochastic image grammar specifying compositional, spatial, temporal and functional relations between visual elements. We developed a large-scale ground-truth image database and an interactive image annotation software to build the And-Or graph from real-world image instances. The second knowledge base is a general knowledge base that interconnects several domain specific ontologies in the form of the Semantic Web. This knowledge base further enriches the semantic representation of visual content with domain specific information. Finally, we demonstrate a case study in video surveillance, an end-to-end system that automatically infers video events and generates natural language descriptions of video scenes. Experiments with maritime and urban scenes indicate the feasibility of the proposed approach.
A unified framework for semantic shot classification in sports video
- Transactions on Multimedia
, 2002
"... In this demonstration, we present a unified framework for semantic shot classification in sports videos. Unlike previous approaches, which focus on clustering by aggregating shots with similar low-level features, the proposed scheme makes use of domain knowledge of specific sport to perform a top-do ..."
Abstract
-
Cited by 42 (8 self)
- Add to MetaCart
(Show Context)
In this demonstration, we present a unified framework for semantic shot classification in sports videos. Unlike previous approaches, which focus on clustering by aggregating shots with similar low-level features, the proposed scheme makes use of domain knowledge of specific sport to perform a top-down video shot classification. That is, combining with inherent game rules and television field production, for each sport through careful observations we predefine a set of semantic shots which cover 90 to 95 % of sports broadcasting video. Under the supervision of predefined shots set, we map the low-level features to high-level semantic video shot attributes such as dominant object motion (a player), persistent camera panning, and court shape. On the basis of the appropriate fusion of those high-level shot attributes, we classify video shots into several predefined categories, each of which has a clear semantic meaning. The experiments show that, compared to traditional clustering methods and key-frame based analysis, the proposed framework features great capability of semantics mining. Due to remarkable structure constraints and limited sports photography, this framework provides a generic solution for sports video shot classification, which can be adapted to a new sport type without major modification. With correctly classified sports video shots further structural and temporal analysis will be greatly facilitated.
Speakers role recognition in multiparty audio recordings using Social Network Analysis and duration distribution modeling
- IEEE Transactions on Multimedia
, 2007
"... Abstract — This paper presents two approaches for speaker role recognition in multiparty audio recordings. The experiments are performed over a corpus of 96 radio bulletins corresponding to roughly 19 hours of material. Each recording involves, on average, eleven speakers playing one among six roles ..."
Abstract
-
Cited by 35 (6 self)
- Add to MetaCart
(Show Context)
Abstract — This paper presents two approaches for speaker role recognition in multiparty audio recordings. The experiments are performed over a corpus of 96 radio bulletins corresponding to roughly 19 hours of material. Each recording involves, on average, eleven speakers playing one among six roles belonging to a predefined set. Both proposed approaches start by segmenting automatically the recordings into single speaker segments, but perform role recognition using different techniques. The first approach is based on Social Network Analysis, the second relies on the intervention duration distribution across different speakers. The two approaches are used separately and combined and the results show that around 85 percent of the recording time can be labeled correctly in terms of role.
Video Data Mining: Semantic Indexing and Event Detection from the Association Perspective
- IEEE Transactions on Knowledge and Data Engineering
, 2005
"... Abstract—Advances in the media and entertainment industries, including streaming audio and digital TV, present new challenges for managing and accessing large audio-visual collections. Current content management systems support retrieval using low-level features, such as motion, color, and texture. ..."
Abstract
-
Cited by 29 (0 self)
- Add to MetaCart
(Show Context)
Abstract—Advances in the media and entertainment industries, including streaming audio and digital TV, present new challenges for managing and accessing large audio-visual collections. Current content management systems support retrieval using low-level features, such as motion, color, and texture. However, low-level features often have little meaning for naive users, who much prefer to identify content using high-level semantics or concepts. This creates a gap between systems and their users that must be bridged for these systems to be used effectively. To this end, in this paper, we first present a knowledge-based video indexing and content management framework for domain specific videos (using basketball video as an example). We will provide a solution to explore video knowledge by mining associations from video data. The explicit definitions and evaluation measures (e.g., temporal support and confidence) for video associations are proposed by integrating the inherent feature of video data. Our approach uses video processing techniques to find visual and audio cues (e.g., court field, camera motion activities, and applause), introduces multilevel sequential association mining to explore associations among the audio and visual cues, classifies the associations by assigning each of them with a class label, and uses their appearances in the video to construct video indices. Our experimental results demonstrate the performance of the proposed approach. Index Terms—Video mining, multimedia systems, database management, knowledge-based systems. æ 1
Audiovisual integration for tennis broadcast structuring
- In International Workshop on (CBMI’03
, 2003
"... This paper focuses on the integration of multimodal features for sport video structure analysis. The method relies on a statistical model which takes into account both the shot content and the interleaving of shots. This stochastic modelling is performed in the global framework of Hidden Markov Mode ..."
Abstract
-
Cited by 29 (4 self)
- Add to MetaCart
(Show Context)
This paper focuses on the integration of multimodal features for sport video structure analysis. The method relies on a statistical model which takes into account both the shot content and the interleaving of shots. This stochastic modelling is performed in the global framework of Hidden Markov Models (HMMs) that can be efficiently applied to merge audio and visual cues. Our approach is validated in the particular domain of tennis videos. The model integrates prior information about tennis content and editing rules. The basic temporal unit is the video shot. Visual features are used to characterize the type of shot view. Audio features describe the audio events within a video shot. As a result, typical tennis scenes are simultaneously segmented and identified. 1.