Large organizations, such as commercial organizations, financial organizations or public safety organizations conduct numerous interactions with customers, users, suppliers or other persons on a daily basis. Many of these interactions are vocal, or at least comprise a vocal component, such as an audio part of a video or face-to-face interaction. A significant part of these interactions takes place between a customer and a representative of the organization such as an agent. Many of the interactions are captured and often recorded.
The interactions convey large volumes of data, which may be of high significance to the organization. However, this data is not structured and therefore not easily accessible. Therefore, in order to get insight into the data conveyed by the interactions, audio analysis techniques need to be applied at the audio in order to extract the information
The interactions and their content can be used for a multiplicity of purposes. One such purpose relates to quality monitoring for assessing the quality of the agent handling the interaction or another entity associated with the call center such as a product, the organization, or the like. Another usage of the interactions relates to analyzing the customer experience, whether the customer is happy with the product or service, threatening to leave, mentioned competitor names, or the like. Automated systems activate multiple tools as part of the analysis. Such tools may include voice recognition tools such as automatic speech recognition or word spotting, emotion analysis tools, call flow analysis, including for example interaction duration, hold time, number of transfers or the like. Different tools may be required for different analyses.
The sides of the interaction, e.g. the agent and the customer may be recorded separately, i.e., on two separate audio signals, in which case it may be known in advance which signal represents the agent and which one represents the customer. In other cases the interactions may be recorded as summed, i.e., the two sides are recorded on one audio signal.
Some of the audio analysis tools are highly dependent on being activated on a single speaker signal. For example, activating an emotion detection tool on a summed audio signal is likely to provide erroneous results. Therefore, in order to activate these tools on summed audio it is required to separate the signal into two signals, each containing speech segments spoken by a single speaker only. Separated signals may contain non-continuous segments of the original interaction, due to speech of the other side, double talk, or the like.
In some embodiments, different analyses may be more relevant to one side of the interaction than to the other. For example, it may be more important to detect emotion on one speech signal than another. For example in the case of a recorded conference call between four or five parties verifying that certain buzzwords had been said may be a part of quality assurance which is relevant to a sales agents' side.
Therefore, in such situations and when the audio is summed, in addition to separating the audio into two or more signals, it is also required to identify which signal represents which speaker, in order to activate relevant analysis tools for each speaker.
There is thus a need for a method for speaker source identification, which will segment a summed audio signal into separate signals if required, and answer the question of who is speaking and when.
Speaker Diarization is the process of partitioning an input audio stream into homogeneous segments according to the speaker identity. It can enhance the readability of an automatic speech transcription by structuring the audio stream into speaker turns and, when used together with speaker recognition systems, by providing the speaker's true identity.
Knowing when each speaker is talking in an audio or video recording can be useful in and of itself, but it is also an important processing step in many tasks. For example, in the field of rich transcription, speaker diarization is used both as a stand-alone application that attributes speaker regions to an audio or video file and as a preprocessing step for speech recognition. Using diarization for speech recognition enables speaker-attributed speech-to-text and can be used as the basis for different modes of adaptation, e.g., vocal tract length normalization (VTLN) and speaker-model adaptation. This task has therefore become central in the speech-research community.
In speaker diarization one of the most popular methods is to use a Gaussian mixture model to model each of the speakers, and assign the corresponding frames for each speaker with the help of a Hidden Markov Model. There are two main kinds of clustering scenario. The first one is by far the most popular and is called Bottom-Up. The algorithm starts in splitting the full audio content in a succession of clusters and progessively tries to merge the redundant clusters in order to reach a situation where each cluster corresponds to a real speaker. The second clustering strategy is called top-down and starts with one single cluster for all the audio data and tries to split it iteratively until reaching a number of clusters equal to the number of speakers.
Short-time spectral features are also frequently employed where short-term and long-term features are fused. Additionally, merging jitter and shimmer with prosodic and spectral features can be done to achieve a 20% relative diarization error rate (DER).
Current diarization systems choose different alternatives for detection of initial segment boundaries. “Compensation of Intra-speaker Variability in Speaker Diarization” as disclosed in US2011251843/U.S. Pat. No. 8,433,567 selects a fixed section length whereas “Unsupervised Speaker Segmentation of Multi-speaker Speech Data” as disclosed in U.S. Pat. No. 7,930,179 utilizes Bayesian information criterion (BIC) for detection of speaker changes. However, BIC is found to be inefficient for detection of short speaker terms which have durations less than 2-5 seconds.
For example “Unsupervised Speaker Segmentation of Multi-Speaker Speech Data” as disclosed in U.S. Pat. No. 7,930,179 and “Method of Speaker Clustering for Unknown Speakers in Conversational Audio Data” as disclosed in U.S. Pat. No. 5,598,507 both utilize bottom-up approach. When features or information about expected speakers are accessible, diarization problem becomes a speaker identification task. In US2013006635 “Method and System for Speaker Diarization” pre trained acoustic models are assumed to be accessible. US20120253811 “Speech Processing System and Method” compares segment parameters with stored speaker profiles.
Some of the previous methods are focused, on specific domains. “Blind Diarization of Recorded Calls with Arbitrary Number of Speakers” as disclosed in US2015025887 focuses on calls and “Audio-Assisted Segmentation and Browsing of News Videos” as disclosed in US2004143434 focuses on broadcast news.
Existing diarization systems do not take into account the prosodic parameters such as pitch, energy, and durations. Prosodic parameters contain information about speaker changes as well as speaker related parameters for clustering. Furthermore, previous diarization systems employ top-down or bottom up clustering approaches that require GMM estimations for all clusters in all steps. This calculations increase processing times and hardware requirements therefore causing inefficiencies in the systems they are used.