Automatic speech recognition and speaker detection are becoming increasingly important areas of application for computer hardware and software development. Methods have been developed to extract features from an audio stream, analyze individual portions of the audio stream, and recognize human speech content contained in the audio stream. Extracted features may be used to generate derivative values such as Mel Frequency Cepstral Coefficients (MFCCs) which may be processed using techniques such as linear discriminant analysis (LDA), delta, double delta, and the like. The details regarding the use of MFCCs in automatic speech recognition are well-known to those of ordinary skill in the art.
Recent research has also explored the use of an associated video stream in enhancing predictions regarding the content of the audio stream. The video stream may be analyzed to determine whether the audio and video streams are in-sync. The analysis of the video stream may also reveal whether a speaker is currently speaking.
Detecting whether the video of a speaking person in front head pose corresponds to an accompanying audio track may be of interest in a wide range of applications. For example, in multi-subject videos, it may be desirable to detect a currently speaking subject to improve performance of speaker diarization/speaker turn detection, or speech separation in the case of overlapping speech, over uni-modal systems that employ traditional audio-only or visual-only processing techniques. As another example, in audio-visual biometrics, spoofing attacks may involve audio and visual data stream that are not in sync. This may occur where an impostor has obtained access to a unimodal target “fingerprint” (such as a recorded audio sample). As another example, in movies, successful lip-syncing/audio dubbing across languages may require that a newly generated audio track is well synchronized to the visual speech articulator motion of the actors in the original video. Finally, storage or transmission bandwidth limitations may cause the loss of blocks of video frames, thus resulting in poor quality video that may not match the audio track accurately. Each of the above problems may be addressed by reliably detecting audio-visual synchrony, indicating consistency between the audio and visual streams.
Similarly, a multi-modal approach employing audio and visual analysis may allow enhanced recognition of speech content during automatic speech recognition processing. Accurate and reliable interpretation of the video stream may allow for improved recognition of phonemes and other utterances contained in human speech. However, traditional methods of analyzing visual features associated with a speaker are both inaccurate and inefficient.
In view of the challenges discussed above, a continued need exists for improved approaches to improving the accuracy and efficiency of audiovisual speech recognition. In particular, there is a need for accurate and efficient visual processing techniques for use in processing a video stream associated with a speaker.