The present disclosure is related to the field of automated transcription. More specifically, the present disclosure is related to diarization of audio data with an arbitrary number of speakers.
Speech transcription and speech analytics of audio data may be enhanced by a process of diarization wherein audio data that contains multiple speakers is separated into segments of audio data typically to a single speaker. While speaker separation in diarization facilitates later transcription and/or speech analytics, the identification of or discrimination between identified speakers can further facilitate these processes by enabling the association of context and information in later transcription and speech analytics processes specific to an identified speaker.
Previous diarization solutions for example of a recorded telephone conversation of a customer service application assume two speakers. The two speakers may exemplarily be a customer and an agent. The two-speaker assumption greatly simplifies the blind-diarization task. However, many calls may have a more complex structure. Some calls may feature only a single speaker, exemplarily a recorded message or an IVR message. Other calls may contain additional “speech-like” segments. For example, these segments may include background talks. Still other examples of complex calls include calls with three speakers or more such as conference calls or calls in which one or more speakers are replaced by another speaker.
Therefore, a blind-diarization algorithm that does not assume any prior knowledge on the number of speakers, and performs robustly on calls with arbitrary number of speakers is achieved in embodiments as disclosed herein.