Speaker diarization is the process of segmenting an audio stream or audio document into speaker homogenous segments and clustering segments according to speaker identity.
Speaker diarization is a key component for indexing audio archives and as part of a transcription system. Speaker diarization can be used for other tasks such as diarization of telephone conversations and meetings, broadcast processing and retrieval, 2-wire telephony processing, etc.
A speaker diarization system usually consists of a speech/non-speech segmentation component, a speaker segmentation component, and a speaker clustering component.
In known systems, an acoustic feature vector is extracted for each frame of input audio data. The acoustic feature vector is produced using standard signal processing techniques that represent the spectral character of speech.
Speaker segmentation is the process of identifying change points in an audio input where the identity of the speaker changes. Segment clustering is the process of clustering segments according to speakers' identities. Speaker segmentation algorithms and segment clustering algorithms process the acoustic feature vectors of the frames of input audio data.
The current approach for speaker diarization is a ‘blind’ algorithm that assumes no prior knowledge on the speakers, and applies the segmentation and clustering using the statistical distribution of acoustic spectral features.