Automated speech recognition is an important technique to implement human machine interfaces (HMIs) in a wide range of applications. In particular, speech recognition is useful in situations where a human user needs to focus on performing a task where using traditional input devices such as a mouse and keyboard would be inconvenient or impractical. For example, in-vehicle “infotainment” systems, home automation systems, and many uses of small electronic mobile devices such as smartphones, tablets, and wearable computers can employ speech recognition to receive speech commands and other input from a user.
Most prior art speech recognition systems use a trained speech recognition engine to convert recorded spoken inputs from a user into digital data that is suitable for processing in a computerized system. Various speech engines that are known to the art perform natural language understanding techniques to recognize the words that the user speaks and to extract semantic meaning from the words to control the operation of a computerized system.
In some situations, a single speech recognition engine is not necessarily optimal for recognizing speech from a user while the user performs different tasks. Prior art solutions attempt to combine multiple speech recognition systems to improve the accuracy of speech recognition including selecting low-level outputs from the acoustic models different speech recognition models or selecting entire sets of outputs from different speech recognition engines based on a predetermined ranking process. However, the prior art techniques that pick outputs from different speech recognition engines are often unsuitable for use in specific tasks where a user often employs some speech from a natural language but combines the natural language speech commands with words and sentences that are used for a specific purpose. For example, in an in-vehicle infotainment system the speech input from a vehicle operator can include a natural language such as English or Chinese combined with specific words and phrases that are not well recognized by speech recognition engines, and merely selecting the outputs of different speech recognition engines that each include errors at a high rate of probability does not increase the overall accuracy of speech recognition. Furthermore, existing speech recognition systems that combine only low-level outputs such as the acoustic model outputs or other low-level features from multiple speech recognition engines cannot evaluate the outputs of different speech recognition engines using higher-level linguistic features. Consequently, improvements to the operation of automated systems to increase the accuracy of speech recognition using multiple speech recognition engines would be beneficial.