Recent advances in computer hardware and software have allowed computer speech recognition (CSR) to cross the threshold of usability. Systems are now available for high end personal computers that can be used for large vocabulary, continuous speech dictation. To obtain adequate performance, such systems need to be adapted to a specific user's voice and environment of usage. In addition, these systems can only recognize words drawn from a certain vocabulary and are usually tied to a particular language model, which captures the relative probabilities of different sequences of words. Without all of these constraints, it is very difficult to get adequate performance from a CSR system.
In most CSR systems, the user and environment specific part, or acoustic models, are usually separate to the vocabulary and language models. However, because of the above constraints, any application that requires speech recognition needs access to both the user/environment specific acoustic models and the application specific vocabulary and language models.
This is a major obstacle to moving CSR systems beyond standalone dictation, to systems where many different users will need to access different applications, possibly in parallel and often over the internet or a local area network (LAN). The reason is that either: (a) each application will have to keep separate acoustic models for each user/environment; or (b) each user will need to maintain separate sets of vocabularies and language models for each application they wish to use. Since the size of acoustic and language models are typically in the order of megabytes to tens of megabytes for a medium to large vocabulary application, it follows that in either scenario (a) or (b), the systems' resources are going to be easily overwhelmed.
One possibility is to store the acoustic models on a different machine to the vocabulary and language models, and connect the machines via a LAN or the internet. However, in either (a) or (b), enormous amounts of network traffic will be generated as megabytes of data are shifted to the target recognizer.
Thus, a need exists for a CSR system that is independent of the vocabulary and language models of an application without sacrificing performance in terms of final recognition accuracy.