The success of currently available speech recognition systems is restricted to relatively controlled environments and well defined applications, such as dictation or small to medium vocabulary voice based control command (e.g., hands free dialing, etc.). In recent years, together with the investigation of several acoustic noise reduction techniques, the study of systems that combine the audio and visual features emerged as an attractive solution to speech recognition under less constrained environments. A number of techniques have been presented to address the audio-visual integration problem, which can be broadly grouped into feature fusion and decision fusion methods.
However, the feature fusion method can suffers from the over-fitting problems, and the decision fusion method cannot capture entirely the dependencies between the audio and video features. In an audiovisual feature fusion system, the observation vectors are obtained by the concatenation of the audio and visual observation vectors, followed by a dimensionality reduction transform. The resulting observation sequences are then modeled using one hidden Markov model (HMM). However, this method cannot model the natural asynchrony between the audio and visual features. Decision fusion systems on the other side model independently the audio and video sequences and enforce the synchrony of the audio and visual features only at the model boundaries. These systems fail to capture entirely the dependencies between the audio and video features. The feature fusion system using a multi-stream HMM assumes the audio and video sequences are state synchronous, but allows the audio and video components to have different contributions to the overall observation likelihood.