This invention relates to the speaker diarization. In particular, the invention relates to providing a confidence measure for speaker diarization.
Speaker diarization aims at segmenting a conversation into homogenous segments in which only one speaker presents and then clustering the segments based on speaker identity. In other words, speaker diarization answers the “Who spoke when?” question for a given audio signal. State-of-the-art algorithms find the speaker turn points and cluster the segments.
Speaker diarization is an important component in many speech applications such as two-wire telephony audio analytics, meeting and lecture summarization, and broadcast processing and retrieval.
A speaker diarization system usually consists of a speech/non-speech segmentation component, a speaker segmentation component, and a speaker clustering component.
Speaker segmentation is the process of identifying change points in an audio input where the identity of the speaker changes. Speaker segmentation is usually done by modeling a speaker with a multivariate normal distribution or with a Gaussian mixture model (GMM) and assuming frame independence. Deciding whether two consecutive segments share the same speaker identity is usually done by applying a Bayesian motivated approach such as Generalized Likelihood Ratio (GLR) or Bayesian Information Criterion (BIC).
Speaker clustering is the process of clustering segments according to speakers' identity. Speaker clustering is usually based on either the BIC criterion or on Cross Likelihood Ratio (CLR).
Intra-speaker variability is the variation of characteristics in a single speaker's output. Compensating for intra-speaker variability can enable more accurate speaker segmentation and clustering.
Confidence measures for speaker diarization generally use the segmentation output as input to the confidence computation. Such known methods include: the Bayesian Information Criterion (BIC) measure of the segmentation accuracy, the Kullback-Leibler divergence measure of the distance between the distributions of the two segmented speakers, the convergence rate of the segmentation algorithm.