The present invention relates generally to speech recognition and speaker identification systems and, more particularly, to methods and apparatus for using video and audio information to provide improved speaker identification.
Many organizations, such as broadcast news organizations and information retrieval services, must process large amounts of audio information, for storage and retrieval purposes. Frequently, the audio information must be classified by subject or speaker name, or both. In order to classify audio information by subject, a speech recognition system initially transcribes the audio information into text for automated classification or indexing. Thereafter, the index can be used to perform query-document matching to return relevant documents to the user. The process of classifying audio information by subject has essentially become fully automated.
The process of classifying audio information by speaker, however, often remains a labor intensive task, especially for real-time applications, such as broadcast news. While a number of computationally-intensive off-line techniques have been proposed for automatically identifying a speaker from an audio source using speaker enrollment information, the speaker classification process is most often performed by a human operator who identifies each speaker change, and provides a corresponding speaker identification.
Humans may be identified based on a variety of attributes of the person, including acoustic cues, visual appearance cues and behavioral characteristics, such as characteristic gestures or lip movements. In the past, machine implementations of person identification have focused on single techniques relating to audio cues alone (for example, audio-based speaker recognition), visual cues alone (for example, face identification or iris identification) or other biometrics. More recently, however, researchers have attempted to combine multiple modalities for person identification, see, e.g., J. Bigun, B. Duc, F. Smeraldi, S. Fischer and A. Makarov, xe2x80x9cMulti-Modal Person Authentication,xe2x80x9d In H. Wechsler, J. Phillips, V. Bruce, F. Fogelman Soulie, T. Huang (eds.) Face Recognition: From Theory to Applications, Berlin Springer-Verlag, 1999. U.S. patent application Ser. No. 09/369,706, filed Aug. 6, 1999, entitled xe2x80x9cMethods and Apparatus for Audio-Visual Speaker Recognition and Utterance Verification,xe2x80x9d assigned to the assignee of the present invention, discloses methods and apparatus for using video and audio information to provide improved speaker recognition.
Speaker recognition is an important technology for a variety of applications including security applications and such indexing applications that permit searching and retrieval of digitized multimedia content. Indexing systems, for example, transcribe and index audio information to create content index files and speaker index files. The generated content and speaker indexes can thereafter be utilized to perform query-document matching based on the audio content and the speaker identity. The accuracy of such indexing systems, however, depends in large part on the accuracy of the identified speaker. The accuracy of currently available speaker recognition systems, however, requires further improvements, especially in the presence of acoustically degraded conditions, such as background noise, and channel mismatch conditions. A need therefore exists for a method and apparatus that automatically transcribes audio information and concurrently identifies speakers in real-time using audio and video information. A further need exists for a method and apparatus for providing improved speaker recognition that successfully perform in the presence of acoustic degradation, channel mismatch, and other conditions which have hampered existing speaker recognition techniques. Yet another need exists for a method and apparatus for providing improved speaker recognition that integrates the results of speaker recognition using audio and video information.
Generally, a method and apparatus are disclosed for identifying the speakers in an audio-video source using both audio and video information. The disclosed audio transcription and speaker classification system includes a speech recognition system, a speaker segmentation system, an audio-based speaker identification system and a video-based speaker identification system. The audio-based speaker identification system identifies one or more potential speakers for a given segment using an enrolled speaker database. The video-based speaker identification system identifies one or more potential speakers for a given segment using a face detector/recognizer and an enrolled face database. An audio-video decision fusion process evaluates the individuals identified by the audio-based and video-based speaker identification systems and determines the speaker of an utterance in accordance with the present invention.
In one implementation, a linear variation is imposed on the ranked-lists produced using the audio and video information by: (1) removing outliers using the Hough transform; and (2) fitting the surviving points set to a line using the least mean squares error method. Thus, the ranked identities output by the audio and video identification systems are reduced to two straight lines defined by:
audioScore=m1xc3x97rank+b1; and
xe2x80x83videoScore=m2xc3x97rank+b2.
The decision fusion scheme of the present invention is based on a linear combination of the audio and the video ranked-lists. The line with the higher slope is assumed to convey more discriminative information. The normalized slopes of the two lines are used as the weight of the respective results when combining the scores from the audio-based and video-based speaker analysis.
The weights assigned to the audio and the video scores affect the influence of their respective scores on the ultimate outcome. According to one aspect of the invention, the weights are derived from the data itself. With w1 and w2 representing the weights of the audio and the video channels, respectively, the fused score, FSk, for each speaker is computed as follows:       w    1    =                              m          1                                      m            1                    +                      m            2                              ⁢              xe2x80x83            ⁢      and      ⁢              xe2x80x83            ⁢              w        2              =                  m        2                              m          1                +                  m          2                    xe2x80x83FSk=W1(m1xc3x97rankk+b1)+w2(m2xc3x97rankk+b2).
where rankk is the rank for speaker k.