The present invention relates to speech recognition. More specifically, the present invention relates to the recognition of spoken, spelled words.
In speech recognition systems, an input speech signal is converted into words that represent the verbal content of the speech signal. This conversion begins by converting the analog speech signal into a series of digital values. The digital values are then passed through a feature extraction unit, which computes a sequence of feature vectors based on the digital values. Each feature vector represents a section of the speech signal.
The feature vectors are then used to identify the most likely sequence of words that would have generated the sequence of feature vectors. Typically, this involves applying the feature vectors to an acoustic model to determine the most likely sequences of sub-word units, typically senones, and then using a language model to determine which of these sequences of sub-word units is most likely to appear in the language. This most likely sequence of sub-word units is identified as the recognized speech.
In many systems, the sub-word units are concatenated to form words, and sequences of words. A language model is accessed to determine a most likely sequence of words. The language model provides a statistical probability of any sequence of words. For example, a trigram language model provides the statistical probability of any three-word sequence. The structure and operation of such language models is well known.
Though some current speech recognition systems attain a high degree of accuracy, they do make mistakes. For example, in a dictation (or document creation) system, a user may be rapidly dictating into the speech recognition system. The system may also provide a graphical output, in the nature of a display, displaying the words, as recognized. If the user notices that a word has been mis-recognized, the user may attempt to correct the word. This often entails the user first selecting the mis-recognized word by highlighting it with a mouse, keyboard, or other user input device. The user then attempts to correct the word using a number of techniques, such as re-speaking the word, or by spelling the word out loud.
However, recognizing spoken, spelled words is very difficult, and presents many problems, primarily due to the existing acoustic similarities among certain groups of letters. There are many confusable groups of letters, most notably xe2x80x9cE-setxe2x80x9d which is formed of the letters b, c, d, e, g, p, t, v and z. Because of the minimal acoustic differences between letter pairs in the E-set, it is recognized as being one of the most confusable sets in the task of recognizing spoken letters. A number of other, less confusable groups, present similar problems as well.
Because of the problems present with recognizing spoken letters, prior speech recognizers invoked dedicated spoken letter recognition systems. This has required the user to affirmatively take action to enter a special spelling recognition mode in order to spell spoken words. Still other systems required the user to spell using the military alphabet (i.e, alpha, bravo, Charlie, etc.). However, this required the user to know the military alphabet, and also required a special purpose lexicon in the speech recognition system to recognize those words.
The speech recognizer includes a dictation language model providing a dictation model output indicative of a likely word sequence recognized based on an input utterance. A spelling language model provides a spelling model output indicative of a likely letter sequence recognized, based on the input utterance. An acoustic model provides an acoustic model output indicative of a likely speech unit recognized based on the input utterances. A speech recognition component is configured to access the dictation language model, the spelling language model and the acoustic model. The speech recognition component weights the dictation model output and the spelling model output in calculating likely recognized speech based on the input utterance. The speech recognizer can also be configured to confine spelled speech to an active lexicon. The present invention can also be practiced as a method.
Another feature of the present invention is directed to creation of the spelling language model. A lexicon is decomposed into individual letters and is then processed into the spelling language model.