In an attempt to improve speech recognition performance, it has been proposed to augment the recognition of spoken utterances with the use of visual data. Such visual data, e.g., images of the mouth (lip) region of the speaker, is typically captured (via a camera) contemporaneous with the capture (via a microphone) of the spoken utterances.
In fact, canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or “visemes.” Visemes can provide information that complements the phonetic stream from the point of view of confusability. By way of example, “mi” and “ni” which are confusable acoustically, especially in noisy situations, are easy to distinguish visually, i.e., in “mi”, lips close at onset; whereas in “ni”, they do not. By way of further example, the unvoiced fricatives “f” and “s”, which are difficult to recognize acoustically, may belong to two different viseme groups. Thus, an audio-visual speech recognition system advantageously utilizes joint audio-visual data models to decode (recognize) input utterances.
However, when implementing an audio-visual speech recognition system, the respective condition of the individual acoustic and visual signals being captured ultimately determines the ability to accurately perform speech recognition. Therefore, in a degraded visual environment, overall speech recognition accuracy may become degraded.
Thus, techniques are needed for improving audio-visual speech recognition performance in a degraded visual environment.