The present invention relates generally to audio information classification systems and, more particularly, to methods and apparatus for identifying speakers in an audio file.
Many organizations, such as broadcast news organizations and information retrieval services, must process large amounts of audio information, for storage and retrieval purposes. Frequently, the audio information must be classified by subject or speaker name, or both. In order to classify audio information by subject, a speech recognition system initially transcribes the audio information into text for automated classification or indexing. Thereafter, the index can be used to perform query-document matching to return relevant documents to the user.
Thus, the process of classifying audio information by subject has essentially become fully automated. The process of classifying audio information by speaker, however, often remains a labor intensive task, especially for real-time applications, such as broadcast news. While a number of computationally-intensive off-line techniques have been proposed for automatically identifying a speaker from an audio source using speaker enrollment information, the speaker classification process is most often performed by a human operator who identifies each speaker change, and provides a corresponding speaker identification.
A number of techniques have been proposed or suggested for identifying speakers in an audio stream, including U.S. patent application Ser. No. 09/434,604, filed Nov. 5, 1999, U.S. patent application Ser. No. 09/345,237, filed Jun. 30, 1999, and U.S. patent application Ser. No. 09/288,724, filed Apr. 9, 1999, each assigned to the assignee of the present invention. U.S. patent application Ser. No. 09/345,237, for example, discloses a method and apparatus for automatically transcribing audio information from an audio source while concurrently identifying speakers in real-time, using an existing enrolled speaker database. U.S. patent application Ser. No. 09/345,237, however, can only identify the set of the speakers in the enrolled speaker database. In addition, U.S. patent application Ser. No. 09/345,237 does not allow new speakers to be added to the enrolled speaker database while audio information is being processed in real-time. A need therefore exists for a method and apparatus that automatically identifies unknown speakers in real-time or in an off-line manner. A further need exists for a method and apparatus that automatically identifies unknown speakers using a hierarchical speaker model tree.
Generally, a method and apparatus are disclosed for identifying speakers participating in an audio-video source, whether or not such speakers have been previously registered or enrolled. A speaker segmentation system separates the speakers and identifies all possible frames where there is a segment boundary between non-homogeneous speech portions. The hierarchical speaker tree clustering system clusters homogeneous segments (generally corresponding to the same speaker), and assigns a cluster identifier to each detected segment, whether or not the actual name of the speaker is known. Thus, segments corresponding to the same speaker should have the same cluster identifier.
According to one aspect of the invention, the disclosed speaker identification system uses a hierarchical enrolled speaker database that includes one or more background models for unenrolled speakers to assign a speaker to each identified segment. Once the speech segments are identified by the segmentation system, the disclosed unknown speaker identification system compares the segment utterances to the enrolled speaker database using a hierarchical approach and finds the xe2x80x9cclosestxe2x80x9d speaker, if any, to assign a speaker label to each identified segment.
A speech segment having an unknown speaker is initially assigned a general speaker label from a set of background models for speaker identification, such as xe2x80x9cunenrolled malexe2x80x9d or xe2x80x9cunenrolled female.xe2x80x9d The xe2x80x9cunenrolledxe2x80x9d segment is assigned a cluster identifier and is positioned in the hierarchical tree. Thus, the hierarchical speaker tree clustering system assigns a unique cluster identifier corresponding to a given node, for each speaker to further differentiate the general speaker labels.