Computer systems are currently in wide use. Some such computer systems receive audio input signals and perform speech processing to generate a speech processing result.
By way of example, some speech processing systems include speech recognition systems that receive an audio signal and, in general, recognize speech in the audio signal and transcribe the speech into text. They can also include audio indexing systems that receive audio signals and index various characteristics of the signal, such as a speaker identity, subject matter, emotion, etc. The speech systems can also include speech understanding (or natural language understanding) systems, that receive an audio signal, identify the speech in the signal, and identify an interpretation of the content of that speech. The speech systems can also include speaker recognition systems. Such systems receive an audio input stream and identify the various speakers that are speaking in the audio stream. Another function often performed is speaker segmentation and tracking, also known as speaker diarization. Speaker diarization is the process of partitioning an input audio stream into homogeneous segments according to speaker identity. It uses a combination of speaker segmentation and speaker clustering. Speaker segmentation finds speaker change points in the audio stream, and speaker clustering groups together speech segments based on speaker characteristics.
By way of example, for a variety of purposes, audio streams containing multiple speakers are often partitioned into segments containing only a single speaker, and non-continuous segments coming from the same speaker are co-indexed. Speaker recognition systems are used to match a speaker-homogeneous section of audio against a speaker model. Audio indexing systems enable retrieval of portions of a meeting recording (or other multiple-speaker recording) by speaker identity. Speech recognition systems can be adapted to characteristics of the specific speaker using this information. Automatic transcription systems can use this information to attribute certain portions of the transcript to the proper speakers, and speech understanding systems can be used to interpret the meaning of an utterance, based upon the identity of the speaker that made the utterance.
In performing these types of speech processing tasks, speech systems must accommodate a relatively high degree of variability within the speech of a given speaker. In addition, the speech signal can often be distorted by extrinsic factors, such as background noise and reverberation, as well as room acoustics, among others. This can add to the difficulty in making comparisons of audio samples for assessing speaker identity.
Current speaker diarization systems extract a fixed, human-designed set of features (which may typically be Mel cepstrum, or MFCC features, etc.) from the audio stream, train Gaussian mixture models for segments of the audio and then cluster the segments according to the similarity of their associated Gaussian distributions. Therefore, speaker similarity is measured indirectly based on similarity of the underlying, predetermined features.
The discussion above is merely provided for general background information and is not intended to be used as an aid in determining the scope of the claimed subject matter.