The present invention relates generally to speech detection and recognition and, more particularly, to methods and apparatus for using video and audio information to provide improved speech detection and recognition in connection with arbitrary content video.
Although significant progress has been made in machine transcription of large vocabulary continuous speech recognition (LVCSR) over the last few years, the technology to date is most effective only under controlled conditions such as low noise, speaker dependent recognition and read speech (as opposed to conversational speech).
In an attempt to improve speech recognition, it has been proposed to augment the recognition of speech utterances with visual cues. This approach has attracted the attention of researchers over the last couple of years, however, most efforts in this domain can be considered to be only preliminary in the sense that, unlike LVCSR efforts, tasks have been limited to small vocabulary (e.g., commands, digits) and often to speaker dependent training or isolated word speech where word boundaries are artificially well defined.
The potential for joint audio-visual-based speech recognition is well established on the basis of psycho-physical experiments, e.g., see D. Stork and M. Henecke, xe2x80x9cSpeechreading by humans and machines,xe2x80x9d NATO ASI Series, Series F, Computer and System Sciences, vol.150, Springer Verlag, 1996; and Q. Summerfeld, xe2x80x9cUse of visual information for phonetic perception,xe2x80x9d Phonetica (36), pp. 314-331 1979. Efforts have begun recently on experiments with small vocabulary letter or digit recognition tasks, e.g., see G. Potamianos and H. P. Graf, xe2x80x9cDiscriminative training of HMM stream exponents for audio-visual speech recognition,xe2x80x9d ICASSP, 1998; C. Bregler and Y. Konig, xe2x80x9cEigenlips for robust speech recognition,xe2x80x9d ICSLP, vol II, pp. 669-672, 1994; and R. Stiefelhagen, U. Meier and J. Yang, xe2x80x9cReal-time lip-tracking for lipreading,xe2x80x9d preprint. In fact, canonical mouth shapes that accompany speech utterances have been categorized, and are known as visual phonemes or xe2x80x9cvisemes.xe2x80x9d Visemes provide information that complements the phonetic stream from the point of view of confusability. For example, xe2x80x9cmixe2x80x9d and xe2x80x9cnixe2x80x9d which are confusable acoustically, especially in noisy situations, are easy to distinguish visually: in xe2x80x9cmixe2x80x9d lips close at onset, whereas in xe2x80x9cnixe2x80x9d they do not. As a further example, the unvoiced fricatives xe2x80x9cfxe2x80x9d and xe2x80x9csxe2x80x9d, which are difficult to recognize acoustically, belong to two different viseme groups. However, use of visemes has, to date, been limited to small vocabulary type recognition and the actual video cues have only been derived from controlled or non-arbitrary content video sources.
Thus, it would be highly desirable to provide methods and apparatus for employing visual information in conjunction with corresponding audio information to perform improved speech recognition, particularly, in the context of arbitrary content video.
Another related problem that has plagued conventional speech recognition systems is an inability of the recognizer to discriminate between extraneous audible activity, e.g., background noise or background speech not intended to be decoded, and speech that is indeed intended to be decoded. Due to this deficiency, a speech recognition system typically attempts to decode any signal picked up by its associated microphone whether it be background noise, background speakers, etc. One solution has been to employ a microphone with a push-to-talk button. In such case, the recognizer will only begin to decode audio picked up by the microphone when the speaker pushes the button. However, this approach has obvious limitations. For example, the environment in which the speech recognition system is deployed may not safely permit the user to physically push a button, e.g., a vehicle mounted speech recognition system. Also, once the speaker pushes the button, any extraneous audible activity can still be picked up by the microphone causing the recognizer to attempt to decode it. Thus, it would be highly advantageous to provide methods and apparatus for employing visual information in conjunction with corresponding audio information to accurately detect speech intended to be decoded so that the detection result would serve to automatically turn on/off decoders and/or a microphone associated with the recognition system.
The present invention provides various methods and apparatus for using visual information and audio information associated with arbitrary video content to provide improved speech recognition accuracy. Further, the invention provides methods and apparatus for using such visual information and audio information to decide whether or not to decode speech uttered by a speaker.
In a first aspect of the invention, a method of providing speech recognition comprises the steps of processing a video signal associated with an arbitrary content video source, processing an audio signal associated with the video signal, and decoding the processed audio signal in conjunction with the processed video signal to generate a decoded output signal representative of the audio signal. The video signal processing operations may preferably include detecting face candidates in the video signal, detecting facial features associated with the face candidates, and deciding whether at least one of the face candidates is in a frontal pose. Fisher linear discriminant (FLD) analysis and distance from face space (DFFS) measures are preferably employed in accordance with these detection techniques. Also, assuming at least one face is detected, visual speech feature vectors are extracted from the video signal. The audio signal processing operation may preferably include extracting audio feature vectors.
In one embodiment, phoneme probabilities are separately computed based on the visual feature vectors and the audio feature vectors. Viseme probabilities may alternatively be computed for the visual information. The probability scores are then combined, preferably using a confidence measure, to form joint probabilities which are used by a search engine to produce a decoded word sequence representative of the audio signal. In another embodiment, the visual feature vectors and the audio feature vectors are combined such that a single set of probability scores are computed for the combined audio-visual feature vectors. Then, these scores are used to produce the decoded output word sequence. In yet another embodiment, scores computed based on the information in one path are used to re-score results in the other path. A confidence measure may be used to weight the re-scoring that the other path provides.
In a second aspect of the invention, a method of providing speech detection in accordance with a speech recognition system comprises the steps of processing a video signal associated with a video source to detect whether one or more features associated with the video signal are representative of speech, and processing an audio signal associated with the video signal in accordance with the speech recognition system to generate a decoded output signal representative of the audio signal when the one or more features associated with the video signal are representative of speech. In one embodiment, a microphone associated with the speech recognition system is turned on such that an audio signal from the microphone may be initially buffered if at least one mouth opening is detected from the video signal. Then, the buffered audio signal is decoded in accordance with the speech recognition system if mouth opening pattern recognition indicates that subsequent portions of the video signal are representative of speech. The decoding operation results in a decoded output word sequence representing the audio signal. While the above embodiment relates to making the speech detection decision based on video information only, a decision could be made using audio information only.
Still further, the invention provides for performing event detection using both the visual information and the audio information simultaneously. Thus, the system looks not only at the mouth movement but also looks to the audio signal to make a determination of whether the received audio signal is speech or not speech. In such an embodiment, an utterance verification procedure, as will be explained, is used to provide this form of event detection.
Of course, it is to be appreciated that any one of the above embodiments or processes may be combined with one or more other embodiments or processes to provide even further speech recognition and speech detection improvements.
Also, it is to be appreciated that the video and audio signals may be of a compressed format such as, for example, the MPEG-2 standard. The signals may also come from either a live camera/microphone feed or a stored (archival) feed. Further, the video signal may include images of visible and/or non-visible (e.g., infrared or radio frequency) wavelengths. Accordingly, the methodologies of the invention may be performed with poor lighting, changing lighting, or no light conditions. Still further, the audio signal may be representative of conversational speech such that improved LVCSR is accomplished. Given the inventive teachings provided herein, one of ordinary skill in the art will contemplate various applications of the invention.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.