1. Technical Field
The present disclosure relates to speech processing and more specifically to recognizing speech from multiple speakers, each in a separate audio channel.
2. Introduction
The market for speech recognition systems is growing quickly, and competition among speech recognizers is becoming more intense. Nevertheless, the accuracy of most speech recognition systems has a lot of room for improvement. Any noticeable improvement in either recognition speed or accuracy or both can become a big selling point. One scenario in which accuracy is lacking is recognizing speech of a conversation between multiple parties, such as during a telephone call. One obstacle in recognizing speech accurately for such a conversation is that speech for each speaker may be best recognized by a tuned recognition model for each speaker, which introduces an element of difficulty in determining which speaker is speaking at any given time. Another option is to use a generic recognition model for both speakers or for more multiple speakers, but this can sacrifice recognition accuracy and/or speed.
FIG. 1 shows an existing prior art approach 100 for recognizing speech from a communication between a first user 104 and a second user 106 via communications infrastructure 102. The communications infrastructure 102 outputs a mono audio signal 108 of the conversation to a recognizer 110 that applies voice activity detection 112, turn detection 114, or similar approaches to sort out when each user is speaking in the mono audio signal 108, and attempts to apply the appropriate recognition model. This approach can be imprecise and computationally intensive because the recognizer 110 may apply the wrong model to portions of the speech in the mono audio signal 108 when generating the recognized speech 116.
Many companies use archives of telephone conversations for analytics, but the conversations are recorded as a mono signal that combines the audio from each speaker of the call into one audio channel. Because of this, the system does not know who is speaking. Automatic techniques can detect dialog turns and divide up the speech into turns, but these techniques are often erroneous. To improve recognition, the system should know which person is speaking before trying to make sense of the speech. Especially if the recognition is personalized for a particular speaker, environment, or codec, accurate knowledge of which person is speaking can reduce error rates significantly, but most of the benefit of personalization disappears and can, in fact, get worse when the system does now know which person is speaking or makes an incorrect guess.