Speech transcription and speech analytics of audio data may be enhanced by a process of diarization wherein audio data that contains multiple speakers is separated into segments of audio data typically to a single speaker. While speaker separation in diarization facilitates later transcription and/or speech analytics, the identification of or discrimination between identified speakers can further facilitate these processes by enabling the association of context and information in later transcription and speech analytics processes specific to an identified speaker.
Previous diarization solutions for example of a recorded telephone conversation of a customer service application assume two speakers. The two speakers may exemplarily be a customer and an agent (i.e. a customer-service representative) in a call center. The two-speaker assumption greatly simplifies the blind-diarization task. However, many calls may have a more complex structure. Some calls may feature only a single speaker, exemplarily a recorded message or an IVR message. Other calls may contain additional “speech-like” segments. For example, these segments may include background talks. Still other examples of complex calls include calls with three speakers or more such as conference calls or calls in which one or more speakers are replaced by another speaker.
Prior blind diarization solutions have relied on a first-pass filtering which may fail to accurately filter out non-speech segments, e.g. noises or music, resulting in too many speakers being created. Additionally, prior blind diarization processes have relied on classification being performed solely on a per-frame basis and thus may fail to detect short utterances that are interleaved with longer utterances of another speaker.
Therefore, a blind-diarization algorithm that does not assume any prior knowledge on the number of speakers, that does not solely rely on per-frame classification, and performs robustly on calls with arbitrary number of speakers is achieved in embodiments as disclosed herein.
Building of acoustic signatures for a common speaker can be a problem. Given a set of recorded sessions (telephone calls, recordings from a meeting room, etc.). Namely, constructing a statistical model that can be used to detect the presence of that speaker in other recorded sessions. In a call-center environment, such a common speaker may be a customer service representative—for which typically there are hundreds of available sessions—or a customer making repeating calls to the call-center. In case of recorded material from meeting rooms, we may be interested in identifying a specific person participating in some of these meetings.
Given recorded audio from all sessions along with markers that indicate the presence of a common speaker within each session (start time and end time of each utterance of that speaker), the solution for creating an acoustic signature for a speaker can be quite straightforward. For example, it is possible to extract acoustic features from all relevant utterances and construct a statistical model that can be used as an acoustic label for the speaker. This can be done using simple classifiers a GMM, or more advanced techniques such as I-vectors.
However, storing and processing audio data from hundreds of recorded sessions may be very time consuming and pose a burden on the network if these sessions needs to be collected from several servers to a single location.
Therefore, a method that creates an acoustic signature for a common speaker based only on statistical models of the speakers in each session is further disclosed herein.