Results 1 -
8 of
8
Short-Term Audio-Visual Atoms for Generic Video Concept Classification
"... We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track asso ..."
Abstract
-
Cited by 4 (1 self)
- Add to MetaCart
We investigate the challenging issue of joint audio-visual analysis of generic videos targeting at semantic concept detection. We propose to extract a novel representation, the Short-term Audio-Visual Atom (S-AVA), for improved concept detection. An S-AVA is defined as a short-term region track associated with regional visual features and background audio features. An effective algorithm, named Short-Term Region tracking with joint Point Tracking and Region Segmentation (STR-PTRS), is developed to extract S-AVAs from generic videos under challenging conditions such as uneven lighting, clutter, occlusions, and complicated motions of both objects and camera. Discriminative audio-visual codebooks are constructed on top of S-AVAs using Multiple Instance Learning. Codebook-based features are generated for semantic concept detection. We extensively evaluate our algorithm over Kodak’s consumer benchmark video set from real users. Experimental results confirm significant performance improvements – over 120 % MAP gain compared to alternative approaches using static region segmentation without temporal tracking. The joint audio-visual features also outperform visual features alone by an average of 8.5 % (in terms of AP) over 21 concepts, with many concepts achieving more than 20%.
Scene understanding: perception, multi-sensor fusion, spatio-temporal reasoning and activity recognition.
, 2007
"... ..."
O.: Violent flows: Real-time detection of violent crowd behavior
, 2012
"... Although surveillance video cameras are now widely used, their effectiveness is questionable. Here, we focus on the challenging task of monitoring crowded events for outbreaks of violence. Such scenes require a human surveyor to monitor multiple video screens, presenting crowds of people in a consta ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Although surveillance video cameras are now widely used, their effectiveness is questionable. Here, we focus on the challenging task of monitoring crowded events for outbreaks of violence. Such scenes require a human surveyor to monitor multiple video screens, presenting crowds of people in a constantly changing sea of activity, and to identify signs of breaking violence early enough to alert help. With this in mind, we propose the following contributions: (1) We describe a novel approach to real-time detection of breaking violence in crowded scenes. Our method considers statistics of how flow-vector magnitudes change over time. These statistics, collected for short frame sequences, are represented using the VIolent Flows (ViF) descriptor. ViF descriptors are then classified as either violent or non-violent using linear SVM. (2) We present a unique data set of realworld surveillance videos, along with standard benchmarks designed to test both violent/non-violent classification, as well as real-time detection accuracy. Finally, (3) we provide empirical tests, comparing our method to state-of-theart techniques, and demonstrating its effectiveness. 1.
Activity Recognition Using a Combination of Category Components and Local Models for Video Surveillance
"... Abstract—This paper presents a novel approach for automatic recognition of human activities for video surveillance applications. We propose to represent an activity by a combination of category components and demonstrate that this approach offers flexibility to add new activities to the system and a ..."
Abstract
- Add to MetaCart
Abstract—This paper presents a novel approach for automatic recognition of human activities for video surveillance applications. We propose to represent an activity by a combination of category components and demonstrate that this approach offers flexibility to add new activities to the system and an ability to deal with the problem of building models for activities lacking training data. For improving the recognition accuracy, a confident-frame-based recognition algorithm is also proposed, where the video frames with high confidence for recognizing an activity are used as a specialized local model to help classify the remainder of the video frames. Experimental results show the effectiveness of the proposed approach. Index Terms—Category components, event detection, local model, video surveillance. I. INTRODUCTION AND RELATED WORK
Anti-social Behavior Detection in Audio-Visual Surveillance Systems
"... Abstract. In this paper we propose a general purpose framework for detection of unusual events. The proposed system is based on the unsupervised method for unusual scene detection in web–cam images that was introduced in [1]. We extend their algorithm to accommodate data from different modalities an ..."
Abstract
- Add to MetaCart
Abstract. In this paper we propose a general purpose framework for detection of unusual events. The proposed system is based on the unsupervised method for unusual scene detection in web–cam images that was introduced in [1]. We extend their algorithm to accommodate data from different modalities and introduce the concept of time-space blocks. In addition, we evaluate early and late fusion techniques for our audiovisual data features. The experimental results on 192 hours of data show that data fusion of audio and video outperforms using a single modality. 1
EVIDENCE FEED FORWARD HIDDEN MARKOV MODEL: A NEW TYPE OF HIDDEN MARKOV MODEL
"... The ability to predict the intentions of people based solely on their visual actions is a skill only performed by humans and animals. The intelligence of current computer algorithms has not reached this level of complexity, but there are several research efforts that are working towards it. With the ..."
Abstract
- Add to MetaCart
The ability to predict the intentions of people based solely on their visual actions is a skill only performed by humans and animals. The intelligence of current computer algorithms has not reached this level of complexity, but there are several research efforts that are working towards it. With the number of classification algorithms available, it is hard to determine which algorithm works best for a particular situation. In classification of visual human intent data, Hidden Markov Models (HMM), and their variants, are leading candidates. The inability of HMMs to provide a probability in the observation to observation linkages is a big downfall in this classification technique. If a person is visually identifying an action of another person, they monitor patterns in the observations. By estimating the next observation, people have the ability to summarize the actions, and thus determine, with pretty good accuracy, the intention of the person performing the action. These visual cues and linkages are important in creating intelligent algorithms for determining human actions based on visual observations. The Evidence Feed Forward Hidden Markov Model is a newly developed algorithm which provides observation to observation linkages. The following research addresses the theory behind Evidence Feed Forward HMMs, provides mathematical proofs of their learning of these parameters to optimize the likelihood of observations with a Evidence Feed Forwards HMM, which is important in all computational intelligence algorithm, and gives comparative examples with standard HMMs in classification of both visual action data and measurement data; thus providing a strong base for Evidence Feed Forward HMMs in classification of many types of problems.
Audio-Visual Event Classification via Spatial-Temporal-Audio Words
"... In this paper, we propose a generative model-based approach for audio-visual event classification. This approach is based on a new unsupervised learning method using an extended probabilistic Latent Semantic Analysis (pLSA) model. We represent each video clip as a collection of spatial-temporal-audi ..."
Abstract
- Add to MetaCart
In this paper, we propose a generative model-based approach for audio-visual event classification. This approach is based on a new unsupervised learning method using an extended probabilistic Latent Semantic Analysis (pLSA) model. We represent each video clip as a collection of spatial-temporal-audio words, which are generated by fusing the visual and audio features using the pLSA model. Each audiovisual event class is treated as the latent topic in this model. The probability distributions of the spatialtemporal-audio words are learnt from training examples, which include a sequence of videos that represent different types of audio-visual events. Experimental results show the effectiveness of the proposed approach. 1.
SN 0249-6399 ISRN INRIA/RR--7865--FR+ENGCalibration of A Binocular-Binaural Sensor Using a Moving Audio-Visual Target
"... apport de recherche ..."

