Results 1 - 10
of
47
Multimodal Video Indexing: A Review of the State-of-the-art
- Multimedia Tools and Applications
, 2003
"... Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video in ..."
Abstract
-
Cited by 103 (18 self)
- Add to MetaCart
Efficient and effective handling of video documents depends on the availability of indexes. Manual indexing is unfeasible for large video collections. In this paper we survey several methods aiming at automating this time and resource consuming process. Good reviews on single modality based video indexing have appeared in literature. Effective indexing, however, requires a multimodal approach in which either the most appropriate modality is selected or the different modalities are used in collaborative fashion. Therefore, instead of separately treating the different information sources involved, and their specific algorithms, we focus on the similarities and differences between the modalities. To that end we put forward a unifying and multimodal framework, which views a video document from the perspective of its author. This framework forms the guiding principle for identifying index types, for which automatic methods are found in literature. It furthermore forms the basis for categorizing these different methods.
Progress in camera-based document image analysis
- Proc. ICDAR’03
, 2003
"... The increasing availability of high performance, low priced, portable digital imaging devices has created a tremendous opportunity for supplementing traditional scanning for document image acquisition. Digital cameras attached to cellular phones, PDAs, or as standalone still or video devices are hig ..."
Abstract
-
Cited by 28 (0 self)
- Add to MetaCart
The increasing availability of high performance, low priced, portable digital imaging devices has created a tremendous opportunity for supplementing traditional scanning for document image acquisition. Digital cameras attached to cellular phones, PDAs, or as standalone still or video devices are highly mobile and easy to use; they can capture images of any kind of document including very thick books, historical pages too fragile to touch, and text in scenes; and they are much more versatile than desktop scanners. Should robust solutions to the analysis of documents captured with such devices become available, there is clearly a demand from many domains. Traditional scanner-based document analysis techniques provide us with a good reference and starting point, but they cannot be used directly on camera-captured images. Camera captured images can suffer from low resolution, blur, and perspective distortion, as well as complex layout and interaction of the content and background. In this paper we present a survey of application domains, technical challenges and solutions for recognizing documents captured by digital cameras. We begin by describing typical imaging devices and the imaging process. We discuss document analysis from a single camera-captured image as well as multiple frames and highlight some sample applications under development and feasible ideas for future development. 1
Automatic Performance Evaluation for Video Text Detection
- Sixth Int. Conf. on Document Analysis and Recognition (ICDAR 2001
, 2001
"... In this paper, we propose an objective, comprehensive and difficulty-independent performance evaluation protocol for video text detection algorithms. The protocol includes a positive set and a negative set of indices at textbox level, which evaluate the detection quality in terms of both location ac ..."
Abstract
-
Cited by 17 (2 self)
- Add to MetaCart
In this paper, we propose an objective, comprehensive and difficulty-independent performance evaluation protocol for video text detection algorithms. The protocol includes a positive set and a negative set of indices at textbox level, which evaluate the detection quality in terms of both location accuracy and fragmentation of the detected textboxes. In the protocol, we assign a detection difficulty (DD) level to each ground truth textbox. The performance indices can then be normalized with respect to the textbox DD level and are therefore independent of the ground truth difficulty. We also assign a detection importance (DI) level to each ground truth textbox. The overall detection rate is the DI-weighted average of the detection qualities of all ground truth textboxes, which makes the detection rate more accurate to reveal the real performance. The automatic performance evaluation scheme has been applied on a text detection approach to determine the best parameters that can yield the best detection results.
A Hierarchical Access Control Model for Video Database Systems
- ACM TRANS. ON INFO. SYST
, 2003
"... ... In this paper, we propose a novel approach to support multilevel access control in video databases. Our access control technique combines a video database indexing mechanism with a hierarchical organization of visual concepts (i.e., video database indexing units), so that different classes of us ..."
Abstract
-
Cited by 16 (7 self)
- Add to MetaCart
... In this paper, we propose a novel approach to support multilevel access control in video databases. Our access control technique combines a video database indexing mechanism with a hierarchical organization of visual concepts (i.e., video database indexing units), so that different classes of users can access different video elements or even the same video element with different quality levels according to their permissions. These video elements, which, in our access control mechanism, are used for specifying the authorization objects, can be a semantic cluster, a subcluster, a video scene, a video shot, a video frame, or even a salient object (i.e., region of interest). In the paper, we first introduce our techniques for obtaining these multilevel
An automatic performance evaluation protocol for video text detection algorithms
- IEEE Transactions on Circuits and Systems for Video Technology
, 2004
"... Abstract—Text presented in the videos provides important supplemental information for video indexing and retrieval. Many efforts have been made for text detection in videos. However, there is still lack of performance evaluation protocols for video text detection. In this paper, we propose an object ..."
Abstract
-
Cited by 16 (1 self)
- Add to MetaCart
Abstract—Text presented in the videos provides important supplemental information for video indexing and retrieval. Many efforts have been made for text detection in videos. However, there is still lack of performance evaluation protocols for video text detection. In this paper, we propose an objective and comprehensive performance evaluation protocol for video text detection algorithms. The protocol includes a positive set and a negative set of indices at textbox level, which evaluate the detection quality in terms of both location accuracy and fragmentation of the detected textboxes. In the protocol, we assign a detection difficulty (DD) level to each ground truth textbox. The performance indices can then be normalized with respect to the textbox DD level and are therefore tolerant to different ground-truth difficulty to a certain degree. We also assign a detectability index (DI) value to each ground truth textbox. The overall detection rate is the DI-weighted average of the detection qualities of all ground truth textboxes, which makes the detection rate more accurate to reveal the real performance. The automatic performance evaluation scheme has been applied to performance evaluation of a text detection approach to determine its best thresholds that can yield the best detection results. The protocol has also been employed to compare the performances of several text detection systems. Hence, we believe that the proposed protocol can be used to compare the performance of different video/image text detection algorithms/systems, and can even help improve, select, and design new text detection methods. Index Terms—Performance Evaluation, Video Text Detection I.
A unified framework for semantic shot classification in sports video
- Transactions on Multimedia
, 2002
"... In this demonstration, we present a unified framework for semantic shot classification in sports videos. Unlike previous approaches, which focus on clustering by aggregating shots with similar low-level features, the proposed scheme makes use of domain knowledge of specific sport to perform a top-do ..."
Abstract
-
Cited by 13 (2 self)
- Add to MetaCart
In this demonstration, we present a unified framework for semantic shot classification in sports videos. Unlike previous approaches, which focus on clustering by aggregating shots with similar low-level features, the proposed scheme makes use of domain knowledge of specific sport to perform a top-down video shot classification. That is, combining with inherent game rules and television field production, for each sport through careful observations we predefine a set of semantic shots which cover 90 to 95 % of sports broadcasting video. Under the supervision of predefined shots set, we map the low-level features to high-level semantic video shot attributes such as dominant object motion (a player), persistent camera panning, and court shape. On the basis of the appropriate fusion of those high-level shot attributes, we classify video shots into several predefined categories, each of which has a clear semantic meaning. The experiments show that, compared to traditional clustering methods and key-frame based analysis, the proposed framework features great capability of semantics mining. Due to remarkable structure constraints and limited sports photography, this framework provides a generic solution for sports video shot classification, which can be adapted to a new sport type without major modification. With correctly classified sports video shots further structural and temporal analysis will be greatly facilitated.
A Laplacian Method for Video Text Detection
"... In this paper, we propose an efficient text detection method based on the Laplacian operator. The maximum gradient difference value is computed for each pixel in the Laplacian-filtered image. K-means is then used to classify all the pixels into two clusters: text and non-text. For each candidate tex ..."
Abstract
-
Cited by 11 (9 self)
- Add to MetaCart
In this paper, we propose an efficient text detection method based on the Laplacian operator. The maximum gradient difference value is computed for each pixel in the Laplacian-filtered image. K-means is then used to classify all the pixels into two clusters: text and non-text. For each candidate text region, the corresponding region in the Sobel edge map of the input image undergoes projection profile analysis to determine the boundary of the text blocks. Finally, we employ empirical rules to eliminate false positives based on geometrical properties. Experimental results show that the proposed method is able to detect text of different fonts, contrast and backgrounds. Moreover, it outperforms three existing methods in terms of detection and false positive rates. 1.
Survey of Compressed-Domain Features used in Audio-Visual Indexing and Analysis
"... In this paper, we attempt to provide a comprehensive and high-level review of audiovisual features that can be extracted from the standard compressed domains, such as MPEG-1 and MPEG-2. The paper is motivated by the myriad of active research works in extraction and application of compressed-domain f ..."
Abstract
-
Cited by 11 (0 self)
- Add to MetaCart
In this paper, we attempt to provide a comprehensive and high-level review of audiovisual features that can be extracted from the standard compressed domains, such as MPEG-1 and MPEG-2. The paper is motivated by the myriad of active research works in extraction and application of compressed-domain features in various fields, such as indexing, filtering, and manipulation. Compressed domain approaches avoid expensive computation and memory requirements involved in decoding and/or re-encoding. Selected features are categorized into four groups -- spatial visual (e.g., color, texture, edge, shape), motion (e.g., motion field, trajectory), audio (e.g., energy, spectral features, pitch), and coding (e.g., bit rate, frame/block type). For each feature, we briefly discuss the extraction methods, computational complexity, potential effectiveness in applications, and possible limitations caused by compress-domain approaches. Finally, we briefly describe audio-visual features specified in the MPEG-7 standard and discuss the possibility of extracting them in the compressed domain.
Extraction and Recognition of Artificial Text in Multimedia Documents
, 2002
"... The systems currently available for content based image and video retrieval work without semantic knowledge, i.e. they use image processing methods to extract low level features of the data. The similarity obtained by these approaches does not always correspond to the similarity a human user woul ..."
Abstract
-
Cited by 11 (5 self)
- Add to MetaCart
The systems currently available for content based image and video retrieval work without semantic knowledge, i.e. they use image processing methods to extract low level features of the data. The similarity obtained by these approaches does not always correspond to the similarity a human user would expect. A way to include more semantic knowledge into the indexing process is to use the text included in the images and video sequences. It is rich in information but easy to use, e.g. by key word based queries. In this paper we present an algorithm to localize artificial text in images and videos using a measure of accumulated gradients and morphological processing. The quality of the localized text is improved by robust multiple frame integration.
Automatic Location of Text in Video Frames
- Proceeding of ACM Multimedia 2001 Workshops: Multimedia Information Retrieval (MIR2001
, 2001
"... A new automatic text location approach for videos is proposed. First of all, the corner points of the selected video frames are detected. After deleting some isolate corners, we merge the remaining corners to form candidate text regions. The regions are then decomposed vertically and horizontally us ..."
Abstract
-
Cited by 10 (3 self)
- Add to MetaCart
A new automatic text location approach for videos is proposed. First of all, the corner points of the selected video frames are detected. After deleting some isolate corners, we merge the remaining corners to form candidate text regions. The regions are then decomposed vertically and horizontally using edge maps of the video frames to get candidate text lines. Finally, a text box verification step based on the features derived from edge maps is taken to significantly reduce false alarms. Experimental results show that the new text location scheme proposed in this paper is accurate.

