Large organizations, such as commercial organizations, financial organizations or public safety organizations conduct numerous interactions with customers, users, suppliers or other persons on a daily basis. Many of these interactions are vocal, or at least comprise a vocal component, such as an audio part of a video or face-to-face interaction. A significant part of these interactions takes place between a customer and a representative of the organization such as an agent. Many of the interactions are captured and often recorded.
The interactions convey large volumes of data which may be of high significance to the organization. However, this data is not structured and therefore not easily accessible. Therefore, in order to get insight into the data conveyed by the interactions, audio analysis techniques need to be applied at the audio in order to extract the information.
The interactions and their content can be used for a multiplicity of purposes. One such purpose relates to quality monitoring for assessing the quality of the agent handling the interaction or another entity associated with the call center such as a product, the organization, or the like. Another usage of the interactions relates to analyzing the customer experience, whether the customer is happy with the product or service, threatening to leave, mentioned competitor names, or the like. Automated systems activate multiple tools as part of the analysis. Such tools may include voice recognition tools such as automatic speech recognition or word spotting, emotion analysis tools, call flow analysis, including for example interaction duration, hold time, number of transfers or the like. Different tools may be required for different analyses.
The sides of the interaction, e.g. the agent and the customer may be recorded separately, i.e., on two separate audio signals, in which case it may be known in advance which signal represents the agent and which one represents the customer. In other cases the interactions may be recorded as summed, i.e., the two sides are recorded on one audio signal.
Some of the audio analysis tools are highly dependent on being activated on a single speaker signal. For example, activating an emotion detection tool on a summed audio signal is likely to provide erroneous results. Therefore, in order to activate these tools on summed audio it is required to separate the signal into two signals, each containing speech segments spoken by a single speaker only. Separated signals may contain non-continuous segments of the original interaction, due to speech of the other side, double talk, or the like.
In some embodiments, different analyses may be more relevant to one side of the interaction than to the other. For example, it may be more important to detect emotion on the customer side than on the agent side. However, verifying that compliance words had been said may be a part of quality assurance which is relevant to the agent side.
Therefore, in such situations and when the audio is summed, in addition to separating the audio into two signals, it is also required to identify which signal represents the agent side and which represents the customer side, in order to activate the tools relevant for each side.
There is thus a need for a method for speaker source identification, which will segment a summed audio signal into separate signals if required, and will associate each one-sided audio signal of an interaction with a customer of a call center or with an agent handling the interaction.