1. Field of the Invention
The invention pertains to the field of machine speech recognition and, more specifically, to the enhancement of acoustic speech recognition by using machine lip reading in conjunction with acoustic data in a neural network classification system.
2. Background to the Invention
The goal of automatic or machine speech recognition is to design a system that approaches the human ability to understand spoken language amidst variations in speaker accents, gender, speech rate, degree of coarticulation, all in the presence of acoustic distractors and noise. Current automated systems are of lower accuracy and robustness than that which is necessary to fulfill the vast need in such applications as computer speech-to-text conversion, automatic translation and speech based control systems. Representative approaches include hidden Markov models in which transition probabilities are encoded in links between nodes (states) representing phonemic segments, and "blackboard" methods in which multiple special purpose phonological, lexical and grammatical based subsystems are combined to work synergistically to maximize speech recognition score. More recently, neural networks have been applied with some success in limited domains, for example, as described by A. Waibel in an article entitled "Modular Construction of Time-Delay Neural Networks for Speech Recognition," published in Neural Computation 1, 39-46 (1989).
Any predictive source of information and any constraints that could be reasonably incorporated into an artificial system would tend to increase the recognition accuracy and thus be desirable to include in a speech recognition system. Traditionally, most research has focussed on the inclusion of high level linguistic information such as grammatical and syntactical data. It is clear that humans can employ information other than the acoustic signal in order to enhance understanding. For example, hearing impaired humans often utilize visual information for "speech reading" in order to improve recognition accuracy. See, for example, Dodd, B. and Campbell, R. (eds.), "Hearing by Eye: The Psychology of Lipreading," Hillsdale, N.J., Lawrence Erlbaum Press (1987); or DeFilippo, C. L. and Sims, D. G. (eds.), "New Reflections on Speechreading," special issue of The Volta Review 90(5), (1988).
Speech reading can provide direct information about speech segments and phonemes, as well as about rate, speaker gender, and identity, and subtle information for separating speech from background noise. The well-known "cocktail party effect," in which speech corrupted by crowd noise is made significantly more intelligible when the talker's face can be seen, provides strong evidence that humans use visual information in speech recognition.
Several speech reading systems have been described recently including:
a) Petajan, E. D., et al., "An Improved Automatic Lipreading System to Enhance Speech Recognition," ACM SIGCHI-88, 19-25 (1988); PA1 b) Pentland, A., et al., "Lip Reading: Automatic Visual Recognition of Spoken Words," Proc. Image Understanding and Machine Vision, Optical Society of America, Jun. 12-14 (1984); and PA1 c) Yuhas, B. P., et al., "Integration of Acoustic and Visual Speech Signals Using Neural Networks," November 1989, IEEE Communications Magazine (1989).
Petajan, et al. used thresholded images (pixels) of a talker's face during the production of a word together with a dictionary of pre-stored labelled utterances and a standard distance classifier for visual recognition.
Pentland, et al. used an optical flow technique to estimate the velocities of the upper lip, lower lip, and the two corners of the mouth from the raw pixel video image of the mouth. They then used a principle components analysis and a minimum distance classifier on three and four digit phrases.
Yuhas, et al. trained a neural network using a static images of the mouth shape for vowel recognition together with a controller with free parameters for adjusting the relative weights of visual and auditory contributions for best recognition in the presence of different levels of acoustic noise.