This invention describes a method and apparatus for combining visual and auditory saliency maps into a format that is usable by a robotic agent.
The invention allows identification of high saliency targets where the targets originate from optical or auditory sensors. Each sensor's data can be independently processed into a saliency map. The methods and apparatus described herein allow fusion of the independent saliency maps into a single, fused multimodal saliency map that is represented in a common coordinate system. This fused saliency map can then be used to determine the most salient targets as well as for subsequent active control of a hardware or device.
It is well known that there is an initial stimulus-driven mechanism that provides weighted representations of sensory scenes, biasing perception toward salient stimuli, i.e., those which are more likely to attract attention or which will be easier to detect. This mechanism postulates that some features in a scene are conspicuous based on their context and, hence, are salient, and thus attract attention; for example, red car on a highway or a police car's siren amid the rush-hour's noise.
The concept of saliency maps has been proposed [1-6] to explain the mechanisms underlying the selection of salient stimuli. These saliency maps employ the hierarchical and parallel extraction of different features and build on existing understanding of sensory processing. For the visual system, such models were shown to replicate several properties of human overt attention [1-4]. More recently, such models have also been proposed for the auditory system [6]. Each of these methods produces a saliency map that employs a coordinate system that makes sense for the modality of the sensor; the visual saliency map represents the visual space in pixels (camera/eye coordinates), while the auditory saliency map employs a frequency-time coordinate space. In addition, saliency typically includes the concept of a priority and queuing. As a result, some auditory saliency maps in the prior art [10] typically will have difficulty processing multiple targets because the map does not include priority or queuing.
Before targets can be selected based on saliency, the targets in each type of map have to be combined into one map and their saliency in that mapping determined. There is a need for a method to combine various saliency maps into one such that targets of interest can be identified and prioritized.
This invention describes a computer program product and method for finding salient regions using visual and auditory sensors, determining the saliency of targets in each sensor's space, then fusing the separate saliency maps into one. This single, multi-modal saliency map uses a common coordinate system and can be used to determine primary and secondary foci of attention as well as for active control of a hardware/device. Such a fused saliency map and associated methods would be useful for robot-based applications in a multi-sensory environment.