The present invention relates to speech recognition systems and, more particularly, to computerized methods and systems for automatically generating, from a first speech recognizer, a second speech recognizer tailored to a certain application.
To achieve a good acoustic resolution across different speakers, domains, or applications, a general purpose large vocabulary continuous speech recognizer, for instance, those based on the Hidden Markov Model (HMM), usually employs several thousands of states and several tens of thousands of elementary probability density functions (pdfs), e.g., gaussian mixture components, to model the observation likelihood of a speech frame. While this allows for an accurate representation of the many variations of sounds in naturally spoken human speech, the storage and evaluation of several tens of thousands of multidimensional pdfs during recognition is a computationally expensive task with respect to both computing time and memory footprints.
Both the total number of context dependent states and gaussian mixture components are usually limited by some upper bound to avoid the use of computationally very expensive optimization methods, like, e.g., the use of a Bayesian information criterion.
However, this bears the disadvantage that some acoustic models are poorly trained because of a mismatch between the collected training data and the task domain or due to a lack of training data for certain pronunciations. In contrast, other models may be unnecessarily complex to achieve a good recognition performance and, in any case, the reliable estimation of several millions of parameters needs a large amount of training data and is a very time consuming process. Whereas applications like large vocabulary continuous dictation systems (like, e.g., IBM Corporation""s ViaVoice) can rely on today""s powerful desktop computers, this is clearly unfeasible in many applications that need to deal with limited hardware resources, like, e.g., in the embedded systems or consumer devices market. However, such applications often need to perform a limited task only, like, e.g., the (speaker dependent) recognition of a few names from a speaker""s address book, or the recognition of a few command words.
A state-of-the-art method dealing with the reduction of resources and computing time for large vocabulary continuous speech recognizers is the teaching of Curtis. D. Knittle, xe2x80x9cMethod and System for Limiting the Number of Words Searched by a Voice Recognition System,xe2x80x9d U.S. Pat. No. 5,758,319, issued in 1998, the disclosure of which is incorporated by reference herein. But as a severe drawback, these methods achieve a resource reduction only by proposing a runtime limitation of the number of candidate words in the active vocabulary by means of precomputed word sequence probabilities (the speech recognizer""s language model). Such an approach seems to be not acceptable as it imposes an undesirable limitation of the recognition scope.
The present invention is based on the objective to provide a technology for fast and easy customization of a general speech recognizer to a given application. It is a further objective to provide a technology for providing specialized speech recognizers requiring reduced computation resources, for instance, in terms of computing time and memory footprints.
In one aspect of the invention, a computerized method and system is provided for automatically generating, from a first speech recognizer, a second speech recognizer tailored to a certain application and requiring reduced resources compared to the first speech recognizer.
The invention exploits the first speech recognizer""s set of states si and set of probability density functions (pdfs) assembling output probabilities for an observation of a speech frame in the states si.
The invention teaches a first step of generating a set of states of the second speech recognizer reduced to a subset of states of the first speech recognizer being distinctive of the certain application.
The invention teaches a second step of generating a set of probability density functions of the second speech recognizer reduced to a subset of probability density functions of the first speech recognizer being distinctive of the certain application.
The teachings of the present invention allow for the rapid development of new data files for recognizers in specific environments and for specific applications. The generated speech recognizers require significantly reduced resources, without decreasing the recognition accuracy or the scope of recognizable words.
The invention allows to achieve a scalable recognition accuracy for the generated application-specific speech recognizer; the generation process can be executed repeatedly until the generated application-specific speech recognizer achieves the required resource targets and accuracy target.