The present invention relates generally to audio information classification systems and, more particularly, to methods and apparatus for transcribing audio information and identifying speakers in an audio file.
Many organizations, such as broadcast news organizations and information retrieval services, must process large amounts of audio information, for storage and retrieval purposes. Frequently, the audio information must be classified by subject or speaker name, or both. In order to classify audio information by subject, a speech recognition system initially transcribes the audio information into text for automated classification or indexing. Thereafter, the index can be used to perform query-document matching to return relevant documents to the user.
Thus, the process of classifying audio information by subject has essentially become fully automated. The process of classifying audio information by speaker, however, often remains a labor intensive task, especially for real-time applications, such as broadcast news. While a number of computationally-intensive off-line techniques have been proposed for automatically identifying a speaker from an audio source using speaker enrollment information, the speaker classification process is most often performed by a human operator who identifies each speaker change, and provides a corresponding speaker identification.
The parent application to the present invention discloses a method and apparatus for retrieving audio information based on the audio content (subject) as well as the identity of the speaker. An indexing system transcribes and indexes the audio information to create time-stamped content index files and speaker index files. The generated content and speaker indexes can thereafter be utilized to perform query-document matching based on the audio content and the speaker identity. A need exists for a method and apparatus that automatically transcribes audio information from an audio source and concurrently identifies speakers in real-time. A further need exists for a method and apparatus that provides improved speaker segmentation and clustering based on the Bayesian Information Criterion (BIC).
Generally, a method and apparatus are disclosed for automatically transcribing audio information from an audio-video source and concurrently identifying the speakers. The disclosed audio transcription and speaker classification system includes a speech recognition system, a speaker segmentation system and a speaker identification system. According to one aspect of the invention, the audio information is processed by the speech recognition system, speaker segmentation system and speaker identification system along parallel branches in a multi-threaded environment.
The speech recognition system produces transcripts with time-alignments for each word in the transcript. The speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. The speaker identification system thereafter uses an enrolled speaker database to assign a speaker to each identified segment.
The present invention utilizes common front-end processing to compute feature vectors that are processed along parallel branches in a multi-threaded environment by the speech recognition system, speaker segmentation system and speaker identification system. Generally, the feature vectors can be distributed to the three multiple processing threads, for example, using a shared memory architecture that acts in a server-like manner to distribute the computed feature vectors to each channel (corresponding to each processing thread).
According to another aspect of the invention, the audio information from the audio-video source is concurrently transcribed and segmented to identify segment boundaries. Once the speech segments are identified by the segmentation system, the speaker identification system assigns a speaker label to each portion of the transcribed text.
The disclosed segmentation process identifies all possible frames where there is a segment boundary, corresponding to a speaker change, on the same pass through the audio data as the transcription engine. A frame represents speech characteristics over a given period of time. The segmentation process determines whether or not there is a segment boundary at a given frame, i, using a model selection criterion that compares two models. A first model assumes that there is no segment boundary within a window of samples,(x1, . . . xn), using a single full-covariance Gaussian. A second model assumes that there is a segment boundary within a window of samples,(x1, . . . xn), using two full-covariance Gaussians, with (x1, . . . xn) drawn from the first Gaussian, and (xi+1, . . . xn) drawn from the second Gaussian.
The disclosed speaker identification system assigns a speaker label to each identified segment, using an enrolled speaker database. The speaker identification process receives the turns identified by the segmentation process, together with the feature vectors generated by the shared-front end. Generally, the speaker identification system compares the segment utterances to the enrolled speaker database and finds the xe2x80x9cclosestxe2x80x9d speaker. A model-based approach and a frame-based approach are disclosed for the speaker identification system.
The results of the present invention can be directly output to a user, for example, providing the transcribed text for each segment, together with the assigned speaker label. In addition, the results of the present invention can be recorded in one or more databases and utilized by an audio retrieval system, such as the audio retrieval system disclosed in the parent application, that combines the results of content and speaker searching methods to provide references to audio information (and indirectly to video) based on the audio content as well as the identity of the speaker.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.