The ease with which individuals can carry on a conversation in the midst of noise is often taken for granted. Sounds from different sources coalesce and obscure each other making it difficult to resolve what is heard into its constituent parts, and identify its source and content. This auditory scene analysis problem confounds current automatic speech recognition systems, which can fail to recognize speech in the presence of very small amounts of interfering noise. With regard to humans, vision often plays a crucial role, because individuals often have an unobstructed view of the lips that modulate the sound. In fact lip-reading can enhance speech recognition in humans as much as removing 15 dB of noise. This fact has motivated efforts to use video information for tasks of audio-visual scene analysis, such as speech recognition and speaker detection. Such systems have typically been built using separate modules for tasks such as tracking the lips, extracting features, and detecting speech components, where each module is independently designed to be invariant to different speaker characteristics, lighting conditions, and noise conditions.
One problem with modular systems designed for a variety of conditions is that there is typically a tradeoff between average performance across conditions and performance in any one condition. Thus, for example, a system that can adapt to a face under the current lighting condition may perform better than one designed for a variety of conditions without adaptation. Another pitfall of modular audio-visual systems is that the modules may be integrated in an ad hoc way that neglects information about the uncertainty within models, as well as neglecting statistical dependencies between the modalities. The two problems are related in that unsupervised adaptation is greatly facilitated by enforcing agreement between the audio and video modules during adaptation.