1. Field of the Invention
The present invention relates to interactive communications systems, and more particularly to a system and method for activating a microphone based on visual speech cues.
2. Description of the Related Art Advances in computer technology have made real-time speech recognition systems available. In turn, speech recognition systems have opened the way towards an intuitive and natural human-computer interaction (HCI). However, current HCI systems using speech recognition require a human to explicitly indicate their intent to speak by turning on a microphone using the key board or the mouse. This can be quite a hindrance in the natural interaction with information. One of the key aspects of naturalness of speech communication involves the ability of humans to detect an intent to speak by a combination of visual and auditory cues. Visual cues include physical proximity, eye contact, lip movement, etc.
Unfortunately, these visual auditory cues are difficult for computer systems to interpret. The naturalness of speech based interaction with computers can be dramatically improved by developing methods for automatic detection of speech onset/offset during speech-based interaction with information (open-microphone solutions). However, purely audio-based techniques suffer from sensitivity to background noise. Furthermore, audio-based techniques require clever buffering techniques since the onset of speech can be robustly detected only when the speech energy crosses a threshold.
Therefore, a need exists for a system and method for determining visual cues conveyed by a speaker and employing these visual cues to activate a microphone.
A system for activating a microphone based on visual speech cues, in accordance with the invention, includes a feature tracker coupled to an image acquisition device. The feature tracker tracks features in an image of a user. A region of interest extractor is coupled to the feature tracker. The region of interest extractor extracts a region of interest from the image of the user. A visual speech activity detector is coupled to the region of interest extractor and measures changes in the region of interest to determine if a visual speech cue has been generated by the user. A microphone is turned on by the visual speech activity detector when a visual speech cue has been determined by the visual speech activity detector.
Another system for activating a microphone based on visual speech cues, in accordance with the present invention, includes a camera for acquiring images of a user and an image difference operator coupled to the camera for receiving image data from the camera and detecting whether a change in the image has occurred. A feature tracker is coupled to the image difference operator, and the feature tracker is activated if a change in the image is detected by the image difference operator to track facial features in an image of a user. A region of interest extractor is coupled to the feature tracker and the image difference operator, and the region of interest extractor extracts a region of interest from the image of the user. A visual speech activity detector is coupled to the region of interest extractor for measuring changes in the region of interest to determine if a visual speech cue has been generated by the user. A microphone is included and turned on by the visual speech activity detector when a visual speech cue has been determined by the visual speech activity detector.
In alternate embodiments, the feature tracker may track facial features of the user, and the feature tracker may include a feature detector for detecting facial features of the user. The region of interest extractor may extract a mouth portion of the image of the user. The visual speech cue may include movement between successive images of one of a mouth region and eyelids of the user. The visual speech cue may be determined in image space of in feature space. The visual speech activity detector may include a threshold value such that the visual speech cue is determined by a standard deviation calculation between regions of interest in successive images which exceeds the threshold value. The visual speech activity detector may provide a feature vector describing the extracted region of interest and includes a classifier for classifying the feature vector as a visual speech cue. The feature vector may be determined by a discrete wavelet transform. The classifier may include a Guassian mixture model classifier. The system may further include an image difference operator coupled to the image acquisition device for receiving image data and detecting whether an image has changed. The system may include a microphone logic circuit for turning the microphone on when the visual speech cue is determined and turning the microphone off when no speech is determined.
A method for activating a microphone based on visual speech cues, includes the steps of: acquiring a current image of a face, updating face parameters when the current image of the face indicates a change from a previous image of the face, extracting a region of interest from the current image as dictated by the face parameters, computing visual speech activity based on the extracted region of interest, and activating a microphone for inputting speech when the visual speech activity has been determined.
In alternate methods, the step of updating face parameters may include the step of invoking a feature tracker to detect and track facial features of the user. The region of interest may include a mouth portion of the image of the user. The step of computing visual speech activity may include calculating movement between successive images of one of a mouth region and eyelids of the user. The visual speech activity may be computed in image space or feature space. The step of computing visual speech activity may include determining a standard deviation between regions of interest in the current image and the previous image, and comparing the standard deviation to a threshold valve such that if the threshold value is exceeded, visual speech activity is determined. The step of computing visual speech activity may includes determining a feature vector based on the region of interest in the current image, and classifying the feature vector to determine if visual speech activity is present. The feature vector may be determined by a discrete wavelet transform. The step of activating a microphone for inputting speech when the visual speech activity has been determined may includes marking an event when the visual speech activity is determined and activating the microphone in accordance with the event. A program storage device readable by machine, tangibly embodying a program of instructions executable by the machine may be employed to perform the above method steps.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.