Results 1 - 10
of
67
State-of-the-Art in visual attention Modeling
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2010
"... Modeling visual attention — particularly stimulus-driven, saliency-based attention — has been a very active research area over the past 25 years. Many different models of attention are now available, which aside from lending theoretical contributions to other fields, have demonstrated successful ap ..."
Abstract
-
Cited by 99 (8 self)
- Add to MetaCart
(Show Context)
Modeling visual attention — particularly stimulus-driven, saliency-based attention — has been a very active research area over the past 25 years. Many different models of attention are now available, which aside from lending theoretical contributions to other fields, have demonstrated successful applications in computer vision, mobile robotics, and cognitive systems. Here we review, from a computational perspective, the basic concepts of attention implemented in these models. We present a taxonomy of nearly 65 models, which provides a critical comparison of approaches, their capabilities, and shortcomings. In particular, thirteen criteria derived from behavioral and computational studies are formulated for qualitative comparison of attention models. Furthermore, we address several challenging issues with models, including biological plausibility of the computations, correlation with eye movement datasets, bottom-up and top-down dissociation, and constructing meaningful performance measures. Finally, we highlight current research trends in attention modeling and provide insights for future.
Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study
"... Abstract—Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from thei ..."
Abstract
-
Cited by 41 (7 self)
- Add to MetaCart
(Show Context)
Abstract—Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from their surroundings. The latter are often referred to as “visual saliency”. Modeling bottom-up visual saliency has been the subject of numerous research efforts during the past 20 years, with many successful applications in computer vision and robotics. Available models have been tested with different datasets (e.g., synthetic psychological search arrays, natural images or videos) using different evaluation scores (e.g., search slopes, comparison to human eye tracking) and parameter settings. This has made direct comparison of models difficult. Here we perform an exhaustive comparison of 35 state-of-the-art saliency models over 54 challenging synthetic patterns, 3 natural image datasets, and 2 video datasets, using 3 evaluation scores. We find that although model rankings vary, some models consistently perform better. Analysis of datasets reveals that existing datasets are highly center-biased, which influences some of the evaluation scores. Computational complexity analysis shows that some models are very fast, yet yield competitive eye movement prediction accuracy. Different models often have common easy/difficult stimuli. Furthermore, several concerns in visual saliency modeling, eye movement datasets, and evaluation scores are discussed and insights for future work are provided. Our study allows one to assess the state-of-the-art, helps organizing this rapidly growing field, and sets a unified comparison framework for gauging future efforts, similar to the PASCAL VOC challenge in the object recognition and detection domains. Index Terms—Visual attention, Visual saliency, Bottom-up attention, Eye movement prediction, Model comparison.
Quaternion-based Spectral Saliency Detection for Eye Fixation Prediction
"... Abstract In recent years, several authors have reported that spectral saliency detection methods provide state-of-the-art performance in predicting human gaze in images (see, e.g., [1 3]). We systematically integrate and evaluate quaternion DCT- and FFT-based spectral saliency detection [3,4], weigh ..."
Abstract
-
Cited by 9 (2 self)
- Add to MetaCart
(Show Context)
Abstract In recent years, several authors have reported that spectral saliency detection methods provide state-of-the-art performance in predicting human gaze in images (see, e.g., [1 3]). We systematically integrate and evaluate quaternion DCT- and FFT-based spectral saliency detection [3,4], weighted quaternion color space components [5], and the use of multiple resolutions [1]. Furthermore, we propose the use of the eigenaxes and eigenangles for spectral saliency models that are based on the quaternion Fourier transform. We demonstrate the outstanding performance on the Bruce-Tsotsos (Toronto), Judd (MIT), and Kootstra-Schomacker eye-tracking data sets. 1
Multimodal Saliency-based Attention for Object-based Scene Analysis
"... Abstract — Multimodal attention is a key requirement for humanoid robots in order to navigate in complex environments and act as social, cognitive human partners. To this end, robots have to incorporate attention mechanisms that focus the processing on the potentially most relevant stimuli while con ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
(Show Context)
Abstract — Multimodal attention is a key requirement for humanoid robots in order to navigate in complex environments and act as social, cognitive human partners. To this end, robots have to incorporate attention mechanisms that focus the processing on the potentially most relevant stimuli while controlling the sensor orientation to improve the perception of these stimuli. In this paper, we present our implementation of audio-visual saliency-based attention that we integrated in a system for knowledge-driven audio-visual scene analysis and object-based world modeling. For this purpose, we introduce a novel isophote-based method for proto-object segmentation of saliency maps, a surprise-based auditory saliency definition, and a parametric 3-D model for multimodal saliency fusion. The applicability of the proposed system is demonstrated in a series of experiments. Index Terms — audio-visual saliency, auditory surprise, isophote-based visual proto-objects, parametric 3-D saliency model, object-based inhibition of return, multimodal attention, scene exploration, hierarchical object analysis, overt attention, active perception I.
Biologically-inspired Visual Attention Features for a Vehicle Classification Task
- The International Journal on Smart Sensing and Intelligent Systems
"... Abstract- The continuous rise in the number of vehicles in circulation brings an increasing need for automatically and efficiently recognizing vehicle categories for multiple applications such as optimizing available parking spaces, balancing ferry loads, planning infrastructure and managing traffic ..."
Abstract
-
Cited by 6 (2 self)
- Add to MetaCart
(Show Context)
Abstract- The continuous rise in the number of vehicles in circulation brings an increasing need for automatically and efficiently recognizing vehicle categories for multiple applications such as optimizing available parking spaces, balancing ferry loads, planning infrastructure and managing traffic, or servicing vehicles. This paper explores the use of human visual attention mechanisms to identify a set of features that allows for fast automated classification of vehicles based on images taken from 6 viewpoints. Salient visual features classified with a series of binary support vector machines and complemented by a dissimilarity score achieve average classification rates between 94 % and 97.3 % for five-category vehicle classification depending on the combination of viewpoints used. The viewpoints that make the most important contribution to the classification are identified in order to decrease the implementation cost. The evaluation of performance against other feature descriptors and various
Saliency-based identification and recognition of pointed-at objects
- In Proc. of Int. Conf. on Intelligent Robots and Systems (IROS
, 2010
"... Abstract — When persons interact, non-verbal cues are used to direct the attention of persons towards objects of interest. Achieving joint attention this way is an important aspect of natural communication. Most importantly, it allows to couple verbal descriptions with the visual appearance of objec ..."
Abstract
-
Cited by 5 (1 self)
- Add to MetaCart
(Show Context)
Abstract — When persons interact, non-verbal cues are used to direct the attention of persons towards objects of interest. Achieving joint attention this way is an important aspect of natural communication. Most importantly, it allows to couple verbal descriptions with the visual appearance of objects, if the referred-to object is non-verbally indicated. In this contri-bution, we present a system that utilizes bottom-up saliency and pointing gestures to efficiently identify pointed-at objects. Furthermore, the system focuses the visual attention by steering a pan-tilt-zoom camera towards the object of interest and thus provides a suitable model-view for SIFT-based recognition and learning. We demonstrate the practical applicability of the proposed system through experimental evaluation in different environments with multiple pointers and objects.
Computational visual attention
- Computer Analysis of Human Behavior (to appear), Advances in Pattern Recognition
, 2011
"... Visual attention is one of the key mechanisms of perception that enables humans to efficiently select the visual data of most potential interest. Machines face similar challenges as humans: they have to deal with a large amount of input data and have to select the most promising parts. In this chapt ..."
Abstract
-
Cited by 4 (2 self)
- Add to MetaCart
(Show Context)
Visual attention is one of the key mechanisms of perception that enables humans to efficiently select the visual data of most potential interest. Machines face similar challenges as humans: they have to deal with a large amount of input data and have to select the most promising parts. In this chapter, we explain the underlying biological and psychophysical grounding of visual attention, show how these mechanisms can be implemented computationally, and discuss why and under what conditions machines, especially robots, profit from such a concept. 1 What Is Attention? And Do We Need Attentive Machines? Attention is one of the key mechanisms of human perception that enables us to act efficiently in a complex world. Imagine you visit Cologne for the first time, you stroll through the streets and look around curiously. You look at the large Cologne Cathedral and at some street performers. After a while, you remember that you have to catch your train back home soon and you start actively to look for signs to the station. You have no eye for the street performers any more. But when you enter the station, you hear a fire alarm and see that people are running out of the station. Immediately you forget your waiting train and join them on their way out. This scenario shows the complexity of human perception. Plenty of information is perceived at each instant, much more than can be processed in detail by the human brain. The ability to extract the relevant pieces of the sensory input at an early processing stage is crucial for efficient acting. Thereby, it depends on the context which part of the sensory input is relevant. When having a goal like catching a train, the signs are relevant, without an explicit goal, salient things like the street performers attract the attention. Some things or events are so salient that they even override
Images as sets of locally weighted features
- Computer Vision and Image Understanding
"... This paper presents a generic framework in which images are modelled as orderless sets of weighted visual features. Each visual feature is associated with a weight factor that may inform its relevance. This framework can be applied to various bag-of-patches approaches such as the bag-of-visual-word ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
This paper presents a generic framework in which images are modelled as orderless sets of weighted visual features. Each visual feature is associated with a weight factor that may inform its relevance. This framework can be applied to various bag-of-patches approaches such as the bag-of-visual-word or the Fisher kernel representations. We suggest that if dense sampling is used, different schemes to weight local features can be evaluated, leading to results that are often better than the combination of multiple sampling schemes, at a much lower computational cost, because the features are extracted only once. This allows our framework to be a test-bed for saliency estimation methods in image categorisation tasks. We explored two main possibilities for the estimation of local feature relevance. The first one is based on the use of saliency maps obtained from human feedback, either by gaze tracking or by mouse clicks. The method is able to profit from such maps, leading to a significant improvement in categorisation performance. The second possibility is based on automatic saliency estimation methods, including Itti&Koch’s method and the SIFT’s DoG. We evaluated the proposed framework and saliency estimation methods using an in house dataset and the PASCAL VOC 2008/2007 dataset, showing that some of the saliency estimation methods lead to a significant performance improvement in comparison to the standard unweighted representation.
An assisted photography method for street scenes
"... We present an interactive, computational approach for assisting users with visual impairments during photographic documentation of transit problems. Our technique can be described as a method to improve picture composition, while retaining visual information that is expected to be most relevant. Our ..."
Abstract
-
Cited by 3 (0 self)
- Add to MetaCart
(Show Context)
We present an interactive, computational approach for assisting users with visual impairments during photographic documentation of transit problems. Our technique can be described as a method to improve picture composition, while retaining visual information that is expected to be most relevant. Our system considers the position of the estimated region of interest (ROI) of a photo, and camera orientation. Saliency maps and Gestalt theory are used for guiding the user towards a more balanced picture. Our current implementation for mobile phones uses optic flow to update the internal knowledge of the position of the ROI and tilt sensor readings to correct non horizontal or vertical camera orientations. Using ground truth labels, we confirmed our method proposes valid strategies for improving image composition. Future work includes an optimized implementation and user studies. 1.