The present invention relates generally to speech recognition systems. More particularly, the invention relates to a speech recognition system that incorporates a language model that organizes syntactic content according to acoustic confusability.
Speech recognition involves acoustic pattern matching between the input speech and a previously trained acoustic model. Typically the model employs a large number of parameters and hence a great deal of processing time is usually expended during the pattern matching phase of the recognition process.
To address the high computational burden, some is recognition systems endeavor to constrain the search space (i.e., consider fewer than all possible pattern matching nodes) based on natural language constraints. In other words, a priori knowledge of the natural language (e.g. English, Japanese, French, Italian) can be used to assist the recognizer in identifying the most probable word candidates. Language models can thus be used as a source of information in speech recognition, to limit the number of acoustic pattern matching sequences that are actually considered during the recognition process. The goal of the language model is to predict words in a given context.
Syntactic models rely on a formal grammar of the language. In such models, syntactic sentence structure are defined by rules that can represent global constraints on word sequences. Statistical models, or stochastic models, use a different approach. Stochastic models provide a probability distribution for a set of suitably defined contexts. The probability distribution depends on the training data available and on how the context has been defined.
Both approaches have their strengths and weaknesses. Syntactic models enforce strong syntactic and grammatical constraints, but they are very difficult to extend to spontaneous speech and natural language. Stochastic models are generally better able to handle spontaneous speech and natural language, however they do not always exclude word strings that make no sense in the natural language.
To further enhance recognition accuracy, some recognition systems apply a priori knowledge of the source content within the lexicon matching phase. Typically, the lexicon matching phase is performed after the recognizer has extracted features from the acoustic information of the input signal using acoustic pattern matching. This lexicon matching phase, in effect, classifies the speech patterns resulting from feature extraction, matching those patterns to entries in the lexicon or dictionary. Dynamic programing algorithms are frequently used in this phase.
While the above-described techniques work well in many applications, there are some particularly difficult speech recognition problems that are not adequately addressed by existing technology. One such problem is the letter recognition problem encountered when attempting to recognize spelled words or names. Spoken letter recognition is difficult, because many letters sound nearly the same, particularly when transmitted through a low quality audio channel, such as currently found in telephone equipment. Anyone who has attempted to spell a name to another person through the telephone will appreciate the difficulty. Where audio transmission quality is lacking, many letters are confusable with one another.
Applying conventional technology, it has been proposed that the dictionary or lexicon (containing all spelled names recognized by the system) also be encoded with knowledge of confusable letters. To do this, the dictionary is augmented with additional entries, representing the original entries, but spelled using confusable letter substitutions. While this technique will work in some cases, it is not without several significant drawbacks. First, adding additional entries to the dictionary corresponding to all possible permutations of letter substitutions, greatly increases the size of the dictionary and thus greatly increases the computational burden. Second, by including all possible substitution permutations, some ambiguities can arise where a given string of letters can map to several different entries in the dictionary.
The present invention takes a different approach. Instead of applying knowledge of confusability in the lexicon or dictionary, the invention injects knowledge of confusability directly into the language model used by the recognizer to limit the number of sequences that are considered at the acoustic level. In other words, knowledge of inherent confusability is embedded into the recognizer itself, where the knowledge is exploited to achieve higher efficiency. Thus confusability is taken into account prior to the lexicon matching phase.
The invention is described herein in the context of a spelled name letter recognizer, inasmuch as the spelled name application presents a good opportunity to exploit inherent confusability. Of course, the principles of the invention are not limited to letter recognition or spelled name recognition. In general, the invention can be used in any speech recognition system for analyzing input speech that corresponds to a pre-defined set of syntactically defined content. In the case of spelled name recognition, the syntactically defined content represents those defined sequences of letters representing names in the dictionary. In a more general application speech recognizer, the syntactically defined content might represent the general concatenation of phonemes, words, phrases and sentences which are grammatically correct according to the particular language model.
A speech recognizer system constructed in accordance with the invention includes a speech recognizer that performs a recognition process on input speech by considering a plurality of acoustic pattern matching sequences. A language model, associated with the recognizer, constrains the number of sequences considered by the recognizer during the recognition process. The language model is based on knowledge of the pre-defined set of syntactically defined content, and includes a data structure that organizes the content according to acoustic confusability. The data structure can take many forms, including an N-gram data structure, a tree data structure, or an interactively configured network having nodes selected based on acoustic distance from a pre-determined lexicon.
For a more complete understanding of the invention, its objects and advantages, refer to the remaining specification and to the accompanying drawings.