The present invention relates generally to audio information classification systems and, more particularly, to methods and apparatus for transcribing audio information and identifying speakers in an audio file.
Many organizations, such as broadcast news organizations and information retrieval services, must process large amounts of audio information, for storage and retrieval purposes. Frequently, the audio information must be classified by subject or speaker name, or both. In order to classify audio information by subject, a speech recognition system initially transcribes the audio information into text for automated classification or indexing. Thereafter, the index can be used to perform query-document matching to return relevant documents to the user.
Thus, the process of classifying audio information by subject has essentially become fully automated. The process of classifying audio information by speaker, however, often remains a labor intensive task, especially for real-time applications, such as broadcast news. While a number of computationally-intensive off-line techniques have been proposed for automatically identifying a speaker from an audio source using speaker enrollment information, the speaker classification process is most often performed by a human operator who identifies each speaker change, and provides a corresponding speaker identification.
The parent and grandparent applications to the present invention disclose methods and apparatus for retrieving audio information based on the audio content (subject) as well as the identity of the speaker. The parent application, U.S. patent application Ser. No. 09/345,237, for example, discloses a method and apparatus for automatically transcribing audio information from an audio source while concurrently identifying speakers in real-time, using an existing enrolled speaker database. The parent application, however, can only identify the set of the speakers in the enrolled speaker database. In addition, the parent application does not allow new speakers to be added to the enrolled speaker database while audio information is being processed in real-time. A need therefore exists for a method and apparatus that automatically identifies unknown speakers in real-time or in an off-line manner. A further need exists for a method and apparatus that automatically identifies unknown speakers using concurrent transcription, segmentation, speaker identification and clustering techniques.
Generally, a method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled. The disclosed unknown speaker classification system includes a speech recognition system, a speaker segmentation system, a clustering system and a speaker identification system. The speech recognition system produces transcripts with time-alignments for each word in the transcript. The speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. The clustering system clusters homogeneous segments (generally corresponding to the same speaker), and assigns a cluster identifier to each detected segment, whether or not the actual name of the speaker is known. Thus, segments corresponding to the same speaker should have the same cluster identifier.
According to one aspect of the invention, the disclosed speaker identification system uses an enrolled speaker database that includes background models for unenrolled speakers to assign a speaker to each identified segment. Once the speech segments are identified by the segmentation system, the disclosed unknown speaker identification system compares the segment utterances to the enrolled speaker database and finds the xe2x80x9cclosestxe2x80x9d speaker, if any, to assign a speaker label to each identified segment. A speech segment having an unknown speaker is initially assigned a general speaker label from a set of background models for speaker identification, such as xe2x80x9cunenrolled malexe2x80x9d or xe2x80x9cunenrolled female.xe2x80x9d The xe2x80x9cunenrolledxe2x80x9d segment is assigned a segment number and receives a cluster identifier assigned by the clustering system. Thus, the clustering system assigns a unique cluster identifier for each speaker to further differentiate the general speaker labels.
The results of the present invention can be directly output to a user, for example, providing the transcribed text for each segment, together with the assigned speaker label. If a given segment is assigned a temporary speaker label associated with an unenrolled speaker, the user can be prompted to provide the name of the speaker. Once the user assigns a speaker label to an audio segment having an unknown speaker, the same speaker name can be automatically assigned to any segments having the same cluster identifier. In addition, the enrolled speaker database can be updated to enroll the previously unknown speaker using segments associated with the speaker as speaker training files.
A more complete understanding of the present invention, as well as further features and advantages of the present invention, will be obtained by reference to the following detailed description and drawings.