The present invention relates to speaker tracking. In particular, the present invention relates to speaker change detection.
Speaker recognition, which involves associating speech with one or more known or unknown speakers, has been researched in recent years. In casual conversation and news broadcasting, it can be difficult to detect when speech from one speaker ends and speech from a different speaker begins. Accurately detecting when there is a speaker change improves the performance of several functions including conference and meeting indexing, audio/video retrieval or browsing, and speaker-specific speech recognition model updates.
One technique for identifying speaker change points is to apply a section of speech to a set of speaker models to determine which speaker model the speech best matches. Although this can be effective in situations where there are well-developed speaker models, the method is not as effective in situations in which identity of the speakers is unknown or the models are not well developed.
In situations in which the identity of the speakers is unknown, there is no training data to obtain an accurate speaker model a priori. Because of this, it is difficult to detect changes in the speakers.
In the prior art, this problem was addressed by attempting to build speaker models from the same speech signal that was being evaluated for speaker change points. However, these systems have relied on iterative algorithms such as the expectation-maximization (EM) algorithm to train the speech models. Because of the iterative nature of these algorithms, the speaker models could not be generated and used in speaker change detection and speaker tracking in real time.
Thus, a method is needed that allows for real-time speaker change detection and speaker tracking without prior knowledge of the identity and the number of the speakers.