We perceive the real world by integrating cues from multiple modalities of perception. Our mental representation is not just based on what we see, but also on the sounds we hear, what we smell, as well as other sensory inputs. For example, fireworks are perceived and remembered as bright flashes of lights, soon followed by loud explosions, concussion waves, and the smell of gun powder.
In contrast, conventional computer recognition systems only operate on signals acquired in a single input modality, e.g., either visual signals, or alternatively, audio signals. Patterns recognized separately in different domain are sometimes combined heuristically after processing. That presumes a prior understanding of how different signals are related, see Hershey et al., in “Using audio-visual synchrony to locate sounds,” Advances in Neural Information Processing Systems 12. MIT Press, Cambridge Mass. 1999, Slaney et al., in “Facesync: A linear operator for measuring synchronization of video facial images and audio tracks,” Advances in Neural Information Processing Systems 13, MIT Press, Cambridge Mass., 2000, Fisher et al., “Learning joint statistical models for audio-visual fusion and segregation,” Advances in Neural Information Processing Systems 13. MIT Press, Cambridge Mass., 2001.
It is desired to provide a system and method for detecting events represented by multiple input modes with a single processing method, without a prior understanding of how the events are related.