Results 1 -
7 of
7
State-of-the-Art in visual attention Modeling
- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE
, 2010
"... Modeling visual attention — particularly stimulus-driven, saliency-based attention — has been a very active research area over the past 25 years. Many different models of attention are now available, which aside from lending theoretical contributions to other fields, have demonstrated successful ap ..."
Abstract
-
Cited by 6 (4 self)
- Add to MetaCart
Modeling visual attention — particularly stimulus-driven, saliency-based attention — has been a very active research area over the past 25 years. Many different models of attention are now available, which aside from lending theoretical contributions to other fields, have demonstrated successful applications in computer vision, mobile robotics, and cognitive systems. Here we review, from a computational perspective, the basic concepts of attention implemented in these models. We present a taxonomy of nearly 65 models, which provides a critical comparison of approaches, their capabilities, and shortcomings. In particular, thirteen criteria derived from behavioral and computational studies are formulated for qualitative comparison of attention models. Furthermore, we address several challenging issues with models, including biological plausibility of the computations, correlation with eye movement datasets, bottom-up and top-down dissociation, and constructing meaningful performance measures. Finally, we highlight current research trends in attention modeling and provide insights for future.
Quantitative Analysis of Human-Model Agreement in Visual Saliency Modeling: A Comparative Study
"... Abstract—Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from thei ..."
Abstract
-
Cited by 2 (2 self)
- Add to MetaCart
Abstract—Visual attention is a process that enables biological and machine vision systems to select the most relevant regions from a scene. Relevance is determined by two components: 1) top-down factors driven by task and 2) bottom-up factors that highlight image regions that are different from their surroundings. The latter are often referred to as “visual saliency”. Modeling bottom-up visual saliency has been the subject of numerous research efforts during the past 20 years, with many successful applications in computer vision and robotics. Available models have been tested with different datasets (e.g., synthetic psychological search arrays, natural images or videos) using different evaluation scores (e.g., search slopes, comparison to human eye tracking) and parameter settings. This has made direct comparison of models difficult. Here we perform an exhaustive comparison of 35 state-of-the-art saliency models over 54 challenging synthetic patterns, 3 natural image datasets, and 2 video datasets, using 3 evaluation scores. We find that although model rankings vary, some models consistently perform better. Analysis of datasets reveals that existing datasets are highly center-biased, which influences some of the evaluation scores. Computational complexity analysis shows that some models are very fast, yet yield competitive eye movement prediction accuracy. Different models often have common easy/difficult stimuli. Furthermore, several concerns in visual saliency modeling, eye movement datasets, and evaluation scores are discussed and insights for future work are provided. Our study allows one to assess the state-of-the-art, helps organizing this rapidly growing field, and sets a unified comparison framework for gauging future efforts, similar to the PASCAL VOC challenge in the object recognition and detection domains. Index Terms—Visual attention, Visual saliency, Bottom-up attention, Eye movement prediction, Model comparison.
Mobile Robot Vision Navigation & Localization Using Gist and Saliency
"... Abstract — We present a vision-based navigation and localization system using two biologically-inspired scene understanding models which are studied from human visual capabilities: (1) Gist model which captures the holistic characteristics and layout of an image and (2) Saliency model which emulates ..."
Abstract
-
Cited by 1 (1 self)
- Add to MetaCart
Abstract — We present a vision-based navigation and localization system using two biologically-inspired scene understanding models which are studied from human visual capabilities: (1) Gist model which captures the holistic characteristics and layout of an image and (2) Saliency model which emulates the visual attention of primates to identify conspicuous regions in the image. Here the localization system utilizes the gist features and salient regions to accurately localize the robot, while the navigation system uses the salient regions to perform visual feedback control to direct its heading and go to a user-provided goal location. We tested the system on our robot, Beobot2.0, in an indoor and outdoor environment with a route length of 36.67m (10,890 video frames) and 138.27m (28,971 frames), respectively. On average, the robot is able to drive within 3.68cm and 8.78cm (respectively) of the center of the lane. I.
Learning Pre-attentive Driving Behaviour from Holistic Visual Features
"... Abstract. The aim of this paper is to learn driving behaviour by associating the actions recorded from a human driver with pre-attentive visual input, implemented using holistic image features (GIST). All images are labelled according to a number of driving–relevant contextual classes (eg, road type ..."
Abstract
-
Cited by 1 (0 self)
- Add to MetaCart
Abstract. The aim of this paper is to learn driving behaviour by associating the actions recorded from a human driver with pre-attentive visual input, implemented using holistic image features (GIST). All images are labelled according to a number of driving–relevant contextual classes (eg, road type, junction) and the driver’s actions (eg, braking, accelerating, steering) are recorded. The association between visual context and the driving data is learnt by Boosting decision stumps, that serve as input dimension selectors. Moreover, we propose a novel formulation of GIST features that lead to an improved performance for action prediction. The areas of the visual scenes that contribute to activation or inhibition of the predictors is shown by drawing activation maps for all learnt actions. We show good performance not only for detecting driving–relevant contextual labels, but also for predicting the driver’s actions. The classifier’s false positives and the associated activation maps can be used to focus attention and further learning on the uncommon and difficult situations. 1
BORJI, et al.: COMPUTATIONAL MODELING OF TOP-DOWN VISUAL ATTENTION 1 Computational Modeling of Top-down Visual Attention in Interactive Environments
"... Modeling how visual saliency guides the deployment of attention over visual scenes has attracted much interest recently — among both computer vision and experimental/computational researchers — since visual attention is a key function of both machine and biological vision systems. Research efforts i ..."
Abstract
- Add to MetaCart
Modeling how visual saliency guides the deployment of attention over visual scenes has attracted much interest recently — among both computer vision and experimental/computational researchers — since visual attention is a key function of both machine and biological vision systems. Research efforts in computer vision have mostly been focused on modeling bottom-up saliency. Strong influences on attention and eye movements, however, come from instantaneous task demands. Here, we propose models of top-down visual guidance considering task influences. The new models estimate the state of a human subject performing a task (here, playing video games), and map that state to an eye position. Factors influencing state come from scene gist, physical actions, events, and bottom-up saliency. Proposed models fall into two categories. In the first category, we use classical discriminative classifiers, including Regression, kNN and SVM. In the second category, we use Bayesian Networks to combine all the multi-modal factors in a unified framework. Our approaches significantly outperform 15 competing bottom-up and top-down attention models in predicting future eye fixations on 18,000 and 75,00 video frames and eye movement samples from a driving and a flight combat video game, respectively. We further test and validate our approaches on 1.4M video frames and 11M fixations samples and in all cases obtain higher prediction scores that reference models. 1
What/Where to Look Next? Modeling Top-down Visual Attention in Complex Interactive Environments
"... Abstract—Several visual attention models have been proposed for describing eye movements over simple stimuli and tasks such as free viewing or visual search. Yet to date, there exists no computational framework that can reliably mimic human gaze behavior in more complex environments and tasks such a ..."
Abstract
- Add to MetaCart
Abstract—Several visual attention models have been proposed for describing eye movements over simple stimuli and tasks such as free viewing or visual search. Yet to date, there exists no computational framework that can reliably mimic human gaze behavior in more complex environments and tasks such as urban driving. Additionally, benchmark datasets, scoring techniques, and top-down model architectures are not yet well understood. In this study, we describe new task-dependent approaches for modeling top-down overt visual attention based on graphical models for probabilistic inference and reasoning. We describe a Dynamic Bayesian Network (DBN) that infers probability distributions over attended objects and spatial locations directly from observed data. Probabilistic inference in our model is performed over object-related functions which are fed from manual annotations of objects in video scenes or by state-ofthe-art object detection/recognition algorithms. Evaluating over ∼3 hours (appx. 315, 000 eye fixations and 12, 600 saccades) of observers playing 3 video games (time-scheduling, driving, and flight combat), we show that our approach is significantly more predictive of eye fixations compared to: (1) simpler classifierbased models also developed here that map a signature of a scene (multi-modal information from gist, bottom-up saliency, physical actions, and events) to eye positions, (2) 14 state-of-theart bottom-up saliency models, and (3) brute-force algorithms such as mean eye position. Our results show that the proposed model is more effective in employing and reasoning over spatiotemporal visual data compared with the state-of-the-art. Index Terms—Visual attention, Bottom-up saliency, , top-down attention, Gaze prediction, eye movement prediction, Interactive

