1. Field
This disclosure relates to speech recognition systems, more particularly to methods to combine the N-best lists from multiple recognizers.
2. Background
Speech recognizers are those components used in speech recognition systems that perform the actual conversion from the incoming audio stream to text or commands. The recognizer uses algorithms to match what the user says to elements in a speech model. The recognizer then returns text corresponding to user""s speech to the application utilizing the speech recognition. In one example, the algorithms are run on a digital signal processor. However, even with powerful processors and detailed speech models, errors still occur. Word recognition rates are generally better than 90%, but failures occur, especially over sequences of words.
Because of uncertainties in the recognition process, the speech recognizer may return several possible text results and allow the application that requested the recognition to select the most appropriate result based on knowledge it possesses regarding the user, the task, the context or other factors. Many speech recognizers support this concept of N-best recognition. The recognizer returns a list of elements that the user might have said, typically accompanied by a score of how confident the recognizer is of each potential match. This list will be referred to here as an N-best list. The application software then decides which entry in the N-best list to use.
Current speech recognition applications use only a single recognizer. However, many speech recognition applications may benefit from the use of several different recognizers. Different recognizers from different manufacturers perform differently even if targeted at the same market. This is due to the use of different algorithms to perform the speech recognition and different training data used to create speech models used by the recognizers. If multiple recognizers are used concurrently, several different N-best lists may be returned to the application. Recognition accuracy could be degraded if the N-best list selected is from a recognizer with poor performance in a particular situation.
Therefore, it would seem useful to have a process for selecting which recognizers should process an audio stream and one for combining N-best lists from different recognizers into one N-best list prior to the list being returned to the application.
One aspect of the disclosure is a speech recognition system. The system includes a port for receiving an input audio stream and one or more recognizers operable to convert the input audio stream from speech to text or commands. The system also includes a combiner operable to combine lists of possible results produced by each recognizer into a combined list. Some subset of the combined list is then sent back to the application, allowing the application to select the desired conversion result.
Another aspect of the disclosure is a method to utilize multiple speech recognizers. An input audio stream is routed to the enabled recognizers. The method of selecting the enabled recognizers is discussed below. A combiner receives a list of possible results from each of the enabled recognizers and combines the lists into a combined list and then returns a subset of that list to the application.
Another aspect of the disclosure is a method of combining N-best lists from multiple speech recognizers. A combiner receives an N-best list from each enabled speech recognizer and combines the entries in each list into an initial N-best list. The N-best list is then potentially reduced in size and sorted according to at least one sorting criteria. A subset of entries in the resulting sorted N-best list is then returned to the application.