Speaker diarization systems are increasingly important for helping to overcome key challenges faced by automatic meeting transcription systems. These systems aim to segment an audio signal into homogeneous sections with one active audio source and answer the question “who spoke when?” Speaker diarization provides important information in multiple applications such as speaker indexing and rich transcription of multi-speaker audio streams. The audio streams may be generated in multiple scenarios such as call centers, broadcast news, or meetings. Often the positions of either the audio source and/or the microphones may be unknown. Additionally, recordings may be distorted by noise, reverberation or non-speech acoustic events (e.g., music) degrading the diarization performance (see “Speaker diarization: a review of recent research” by X. Anguera).
Current algorithms can only utilize spatial information when multi-microphone recordings are available. This information is usually related to a Time Delay of Arrival (TDOA) that represents the time delay of the same signal in two different microphones. In a single-microphone scenario this feature is infeasible to compute and common speech features such as Mel-Frequency Cepstral Coefficients (MFCC) and/or Perceptual Linear Predictive (PLP) may be used to diarize. Furthermore, these current algorithms use multi-microphone information from arrayed systems where the relative location of each microphone to a point of reference is known.
Although TDOA has previously been utilized in a number of different fields, this parameter is not used for microphone selection due to excessive noise that may be created by ambient noise, reverberation, and/or head movements of a human speaker. Furthermore, microphones located a long distance from an audio source often yield unreliable TDOA estimates.