The decision making process of authenticating (e.g., recognizing, identifying, verifying) a speaker is an important step in ensuring the security of systems, networks, services and facilities, both for physical and for logical access. However, accurate speaker authentication is also a goal in applications other than secure access-based applications.
Some existing automated speaker authentication techniques rely exclusively on an audio stream captured from the speaker being authenticated. However, it is known that, even in the case of clean speech (e.g., speech collected over a high quality microphone in an environment with little noise), as opposed to the case of degraded speech (e.g., speech collected over noisy phone lines or in environments with substantial background noise and distortion), there exists a subset of the population for which audio-based authentication is problematic and inconsistent. For example, in G. Doddington et al., “Sheep, Goats, Lambs and Wolves: An Analysis of Individual Differences in Speaker Recognition Performance,” NIST Presentation at ICSLP98, Sydney, Australia, November 1999, it is shown that there are speakers, termed “goats,” who are difficult to recognize based on their voice. Speakers who are readily recognized based on voice are termed “sheep.”
Thus, other existing automated speaker authentication techniques have adopted an approach wherein, in addition to the use of the audio stream, a video stream representing the speaker is taken into account, in some manner, in making the speaker authentication decision.
In accordance with such two-stream systems, one may manually choose to make an a priori decision as to the efficacy of audio data versus video data for each individual and subsequently use only the data corresponding to the most effective modality.
Another option is to model the joint statistics of the data streams. However, a more flexible option is to create models independently for each data modality and utilize scores and decisions from both. Previous studies utilizing independent models, such as those detailed in B. Maison et al., “Audio-Visual Speaker Recognition for Video Broadcast News: Some Fusion Techniques,” IEEE Multimedia Signal Processing (MMSP99), Denmark, September 1999, have been applied only at the test utterance level in the degraded speech case.