Diarization systems can be distinguished into different groups based on how incoming audio is consumed. An offline system waits until the end of the audio stream before processing it and finding the homogeneous segments where different speakers are active. A second group of systems is based on processing the audio file as the audio comes in without any knowledge of future events. The latter approach is particularly challenging since the causality of the system does not allow it to provision for audio events, such as: on-hold music, prerecorded speech, synthesized speech, tones, and several speakers. Such limitations may limit the overall diarization performance.
A standard approach is to perform speaker detection and clustering in a causal way based on various distances and methods. This “greedy” approach tends to overestimate the number of speakers quickly fills the available slots by assigning them to irrelevant acoustic events (such as those described above). For example, the greedy approach struggles when a conversation starts with a computer voice whispering into an agent's ear a short summary of what happened in an Interactive Voice Response (IVR) system before a call is transferred to the agent. Errors are carried forward through an identification process and are compounded in instances with multiple callers.