The present invention relates generally to the use of visual information in speech recognition.
In speech recognition, the use of visual information has been of interest since speech recognition efficiency can be significantly improved in conditions where audio-only recognition suffers due to a noisy environment. Particularly, a main focus of recent developments has been to increase the robustness of speech recognition systems against different types of noises in the audio channel.
In this connection, it has been found that the performance of most, if not all, conventional speech recognition systems has suffered a great deal in a non-controlled environment, which may involve, for example, background noise, a bad acoustic channel characteristic, crosstalk and the like. Thus, video can play an important role in such contexts as it provides significant information about the speech that can compensate for noise in the audio channel. Furthermore, it has been observed that some amount of orthogonality is present between the audio and the video channel, and this orthogonality can be used to improve recognition efficiency by combining the two channels. The following publications are instructive in this regard: Tsuhan Chen and Ram R. Rao, xe2x80x9cAudio-Visual Integration in Multimodal Communicationxe2x80x9d, Proceedings of IEEE, vol. 86, May 1998; H. McGurk and J. MacDonald, xe2x80x9cHearing Lips and seeing voicesxe2x80x9d, Nature, pp. 746-748, December 1976; and K. Green, xe2x80x9cThe use of auditory and visual information in phonetic perceptionxe2x80x9d, Speechreading by Humans and Machines, D. Stork and M. Hennecke, Eds Berlin, Germany.
Experiments have also been conducted with various features of audio and visual speech and different methods of combining the two information channels. One of the earliest audio-visual speech recognition systems was implemented by E. D. Petajan (see E. D. Petajan, xe2x80x9cAutomatic lipreading to enhance speech recognitionxe2x80x9d, Proc. IEEE Global Telecommunication Conf., Atlanta, 1984; and E. D. Petajan, B. Bischoff, D. Bodoff and N. M. Brooke, xe2x80x9cAn improved automatic lipreading system to enhance speech recognitionxe2x80x9d, Proc. CHI""88 pp. 19-25). In Petajan""s experiment, binary images were used to extract mouth parameters such as height, width and area of the mouth of the speaker. These parameters were later used in the recognition system. The recognition system was an audio speech recognizer followed by a visual speech recognizer. Therefore, a visual speech recognizer would work only on a subset of all of the possible candidates which were supplied to it by the audio speech recognizer. Later, the system was modified to use the images themselves instead of the mouth parameters and the audio-visual integration strategy was changed to a rule-based approach from the sequential integration approach.
A. J. Goldschen, in xe2x80x9cContinuous automatic speech recognition by lipreadingxe2x80x9d (Ph.D. dissertation, George Washington University, Washington, September 1993), analyzed a number of features of the binary images such as height, width and perimeter, along with derivatives of these quantities, and used these features as the input to an HMM (Hidden Markov Model)-based visual speech recognition system. Since then, several experiments have been performed by various researchers to improve upon these basic blocks of audio-visual speech recognition (Chen et al., supra, and: Gerasimos Potamianos and Hans Peter Graf, xe2x80x9cDiscriminative Training of HMM Stream Exponents for Audio-Visual Speech Recognitionxe2x80x9d, ICASSP ""98; Christopher Bregler and Yochai Konig, xe2x80x9cxe2x80x98Eigenlipsxe2x80x99 for Robust Speech Recognitionxe2x80x9d, ICASSP ""98; C. Bregler, Stefan Manke, Hermann Hild, Alex Waibel, xe2x80x9cBimodal Sensor Integration on the Example of xe2x80x98Speech Readingxe2x80x99xe2x80x9d, IEEE International Conference on Neural Networks, 1993; Uwe Meier, Wolfgang Hxc3xcrst and Paul Duchnowski, xe2x80x9cAdaptive Bimodal Sensor Fusion for Automatic Speechreadingxe2x80x9d, ICASSP ""96; C. Bregler, H. Manke, A. Waibel, xe2x80x9cImproved Connected Letter Recognition by Lipreadingxe2x80x9d, ICASSP ""93; and Mamoun Alissali, Paul Deleglise and Alexandrina Rogozan, xe2x80x9cAsynchronous Integration of Visual Information in an Automatic Speech Recognition Systemxe2x80x9d, ICSLP ""96).
However, challenges are often encountered when there is a need to combine audio and visual streams in an intelligent manner. While a general discussion of data fusion may be found in xe2x80x9cMathematical Techniques in Multisensor Data Fusionxe2x80x9d (David L. Hall, Artech House, 1992), the article xe2x80x9cAudio-Visual Large Vocabulary Continuous Speech Recognition in the Broadcast Domainxe2x80x9d (Basu et al., IEEE Workshop on Multimedia Signal Processing, Sep. 13-15, Copenhagen 1999) describes early attempts at audio-visual recognition. A need, however, has been recognized in connection with producing improved results.
Generally speaking, some problems have been recognized in conventional arrangements in combining audio with video for speech recognition. For one, audio and video features have different dynamic ranges. Additionally, audio and video features have different numbers of distinguishable classes, that is, there are typically a different number of phonemes than visemes. Further, due to complexities involved in articulatory phenomena, there tends to be a time offset between audio and video signals (see xe2x80x9cEigenlipsxe2x80x9d, supra). Moreover, video signals tend to be sampled at a slower rate than the audio and, therefore, needs to be interpolated.
In view of the problems stated above and others, two different approaches to combine audio and visual information have been tried. In the first approach, termed xe2x80x9cearly integrationxe2x80x9d or xe2x80x9cfeature fusionxe2x80x9d, audio and visual features are computed from the acoustic and visual speech, respectively, and are combined prior to the recognition experiment. Since the two sets of features correspond to different feature spaces, they may differ in their characteristics as described above. Therefore, this approach essentially requires an intelligent way to combine the audio and visual features. The recognition is performed with the combined features and the output of the recognizer is the final result. This approach has been described in Chen et al., Potamianos et al., xe2x80x9cEigenlipsxe2x80x9d and Basu et al, supra. However, it has been found that this approach cannot handle different classifications in audio and video since it uses a common recognizer for both.
In the second approach, termed xe2x80x9clate integrationxe2x80x9d or xe2x80x9cdecision fusionxe2x80x9d, separate recognizers are incorporated for audio and visual channels. The outputs of the two recognizers are then combined to arrive at the final result. The final step of combining the two outputs is essentially the most important step in this approach since it concerns issues of orthogonality between the two channels as well as the reliability of the two channels. This approach tends to handle very easily the different classifications in audio and video channels as the recognizers for them are separate and the combination is at the output level. This approach has been described in xe2x80x9cBimodal Sensor Integrationxe2x80x9d, Meier et al., xe2x80x9cImproved Connected Letter . . . xe2x80x9d and Alissali et al., supra.
However, it is to be noted that conventional approaches, whether involving xe2x80x9cearlyxe2x80x9d or xe2x80x9clatexe2x80x9d integration, use a single-phase experiment with a fixed set of phonetic or visemic classes and that the results are not always as favorable as desired. A need has thus been recognized in connection with providing a more effective combination strategy.
The present invention broadly contemplates method and apparatus for providing innovative strategies for data fusion, particularly, multi-phase (such as two-phase) hierarchical combination strategies. Surprising and unexpected results have been observed in connection with the inventive strategies.
In accordance with at least one presently preferred embodiment of the present invention, in particular, the combined likelihood of a phone is determined in two phases. In the first phase, a limited number of viseme-based classes (which will typically be smaller than the number of corresponding phoneme-based classes) are used for both audio and video. At the end of the first phase, the most likely viseme-based class is determined. However, in the second phase, only those phones that are embedded in the viseme given by the first phase are considered.
The present invention, in accordance with at least one presently preferred embodiment, broadly contemplates methods and apparatus in which a video signal associated with a video source and an audio signal associated with the video signal are processed, the most likely viseme associated with the audio signal and video signal is determined and, thereafter, the most likely phoneme associated with the audio signal and video signal is determined.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.