1. The Field of the Invention
The present invention relates to systems and methods for segmenting multi-speaker speech or audio data by speaker. More particularly, the present invention relates to systems and methods for unsupervised segmentation of telephone conversations by speaker.
2. Background and Relevant Art
The segmentation of multi-speaker speech or audio data by speaker has received considerable attention in recent years. One goal of speaker segmentation is to identify the segments of the speech data that correspond to the speakers. Speaker segmentation can be useful in automatic speech recognition (ASR) systems for a variety of different reasons. For example, speaker segmentation is used in training natural speech automatic call classification systems.
In call classification systems, the multi-speaker speech data usually includes a telephone conversation between two different speakers and speaker segmentation is used to identify the segments of the speech data that correspond to each speaker. For example, when a customer calls a customer representative at a call center, speaker segmentation can be used in conjunction with the ASR system to identify the customer's request from the speech data. In other words, speaker segmentation identifies the segments that correspond to the customer and the ASR system can recognize the customer's request that is found in the identified segments that correspond to the customer. Alternatively, the segments can be used for training purposes to find customer requests in conversations or to adapt ASR models and language understanding models in multi-speaker speech.
Speaker segmentation of multi-speaker speech data can be either supervised or unsupervised. In supervised speaker segmentation, pre-existing labeled models are used to segment the multi-speaker speech data. Unsupervised segmentation of multi-speaker speech data is considerably more difficult than supervised segmentation of multi-speaker speech data because the multi-speaker speech data is segmented without the benefit of pre-existing labeled models or prior information. As a result, unsupervised segmentation of multi-speaker speech data typically performs more poorly than supervised segmentation of multi-speaker speech data.
In addition to not having models or other information to help segment the speech data by speaker, unsupervised segmentation of speech data faces several additional obstacles that complicate the task of separating the segments of one speaker from the segments of another speaker. For example, multi-speaker speech data typically includes several short segments. Short segments are difficult to analyze because of the inherent instability of short analysis windows. In addition, more than one speaker may be talking at the same time in multi-speaker speech data and the segments may be contaminated with the speech of another speaker.