Automatic speech recognition techniques allow extracting business insights from telephone conversations with customers of an organization. This data allows improving sales and customer success, Customer Support, Marketing and Product functions (e.g., to understand and hear the “voice of the customer”) by providing coaching to representatives of the organization, e.g., on desired behaviors, measuring compliance and generating data regarding market and product requirements automatically. Such data can also be used for determining best practices by identifying winning patterns, making sales process more efficient by summarizing calls so that the representatives can have less sync meetings, and for guiding conversations in real-time. Attributing utterances and words to the person who spoke them is useful for any downstream analysis such as search, call visualization, identifying buying cues, extracting customer pain points, identifying good/bad sales behaviors, and extracting notes and tasks.
When a call is recorded as a single channel (mono), or when multiple speakers are co-located in the same room, identifying the speaker requires applying various algorithmic techniques. Previous technologies aim to split the call between different speakers, an approach termed “diarization”, e.g., determine that a particular voice is of “speaker 1,” on the call; another voice is of “speaker 2” on the call, another voice is of “speaker 3” on the call, and so on. Such technologies may not identify those speakers. Some other technologies use multi-channel recordings in which each of the speakers in the conversation is on a separate communication channel and the speakers can be identified based on the channel assigned to the speaker. However, such technologies may not work in a single channel recording.
Some technologies may identify the speakers, but they need to record a short voice sample for creating a speaker fingerprint and use this speaker fingerprint to identify the speakers. However, this requires active participation of the recorded user which can hurt adoption rates and provide a non-smooth user experience.