1. Field of the Invention
The present invention relates to speech recognition and more specifically to various embodiments related to using pronunciation dependent language models and word clusters in the context of speech recognition.
2. Introduction
Historically, speech recognition started by attempting to solve the easiest, yet important, recognition tasks. For example, a spoken dialog system may ask a user to say “yes” or “no” in response to a question and the system will recognize the utterance and act accordingly. In a more complex dialog system and user may state something like “I want my account balance” and the system will attempt to identify the task—which is to present the user with the account balance. Those tasks invariably had simple language models which are small vocabularies associated with well-defined applications. Examples include digit recognition, alphabet recognition or simple lists of commands. Given the scope of the tasks, it was easy to collect training data that allowed whole word acoustic models which more recently became either context dependent whole word models or context dependent head-body-tail word fraction models. As the size and scope of the recognition tasks grew, the ability to provide such training data coverage diminished, and context dependent sub-word units became the acoustic units of choice.
In all of these cases, the basic unit for building the language models had always been the basic lexical unit, the word. In rare cases this model for the structure of the recognition system was broken, mostly to try to account for major pronunciation changes due to heavy coarticulation that occurs in some short phrases, which would be given a new lexical entry and appropriate dictionary entry accounting for the pronunciation changes from the baseline phonemic baseforms. In the language modeling domain, many have attempted to model short frequent phrases as lexical items which were mostly successful, although not always successfully. In recent years the ability and willingness to collect ever more transcribed speech, albeit that the transcriptions were often noisy due to the need to do the transcriptions quickly and inexpensively, has resulted in several databases that are generally available and suitable for building recognition models that until recently would have been impossible. An example is a successful attempt to build a huge acoustic model using full covariances for tens of thousands of Gaussian components. To build that model all the available speech training data in the EARS program from the Switchboard database was used.
The availability of large databases has yet to provide speech recognition to an optimal level. What is needed in the art are improved approaches of providing speech recognition given the availability of large databases.