There has long been a desire to have machines capable of responding to human speech, such as machines capable of obeying human commands and transcribing human dictation. Such machines would make it much easier for humans to communicate with computers, as well as to record and organize their own thoughts.
Due to recent advances in computer technology and recent advances in the development of algorithms for the recognition of speech, speech recognition machines have begun to appear in the past several decades, and have begun to become increasingly more powerful and increasingly less expensive. For example, the assignee of the present application has previously marketed speech recognition software which runs on popular personal computers and which requires little extra hardware except for an inexpensive microphone, an analog to digital converter, a preamplifier and a relatively inexpensive microprocessor to perform simple signal processing. This system is capable of providing speaker dependent, discrete word recognition for vocabularies of up to 64 words at any one time. An even more advanced form of dictation system is described in U.S. patent application Ser. No. 797,249, entitled "Speech Recognition Apparatus And Method", filed November 12, 1985 by James K. Baker et al, the assignee of which is the assignee of the present application. This U.S. patent application Ser. No. 797,249, which is incorporated herein by reference, discloses a speech recognition system of a type capable of recognizing vocabularies of many thousands of words.
Most present speech recognition systems operates by matching an acoustic description, or model, of a word in their vocabulary against a representation of the acoustic signal generated by the utterance of the word to be recognized. In many such systems, the acoustic signal generated by the speaking of the word to be recognized is converted by an A/D converter into a digital representation of the successive amplitudes of the audio signal created by the speech. Then, that signal is converted into a frequency domain signal, which consists of a sequence of frames, each of which gives the amplitude of the speech signal in each of a plurality of frequency bands over a brief interval of time. Such systems commonly operate by comparing the sequence of frames produced by the utterance to be recognized with a sequence of nodes, or frame models, contained in the acoustic model of each word in their vocabulary.
Originally, the performance of such frame matching systems was relatively poor, since the individual sounds which make up a given word are seldom, if ever, spoken at exactly the same rate or in exactly the same manner in any two utterances of that word. Fortunately, two major techniques have been developed which greatly improved the performance of such systems. The first is probabilistic matching, which determines the likelihood that a given frame of an utterance corresponds to a given node in an acoustic model of a word. It determines this likelihood not only as a function of how closely the amplitude of the individual frequency bands of the frame match the expected frequencies contained in the given node models, but also as a function of how the deviation between the actual and expected amplitudes in each such frequency band compares to the expected deviations for such values. Such probabilistic matching gives a recognition system a much greater ability to deal with the variations in speech sounds which occur in different utterances of the same word, and a much greater ability to deal with the noise which is commonly present during speech recognition tasks.
The second major technique which greatly improves the performance of such frame matching systems is that of dynamic programming. Stated simply, dynamic programming provides a method to find an optimal, or near optimal, match between the sequence of frames produced by an utterance and the sequence of nodes contained in the model of a word. It does this by effectively expanding and contracting the duration of each node in the acoustic model of a word to compensate for the natural variations in the duration of speech sounds which occur in different utterances of the same word. A more detailed discussion of the application of dynamic programming to speech recognition is available in J. K. Baker's article entitled "Stochastic Modeling for Automatic Speech Recognition" in the book Speech Recognition, edited by D. R. Reddy and published by Academic Press, New York, New York in 1975.
The performance of present speech recognition systems is impressive when compared to similar systems of a short time ago. Nevertheless, there is still a need for further improvement. For example, in order for a speech recognition system to be of practical use for many tasks, it needs to be able to recognize a large vocabulary of words. Unfortunately, most high performance speech recognition systems require that a given user speak each word in the system's vocabulary a plurality of times before the system can reliably rocognize the speaking of that word by the given user. This enables the system to develop a relatively reliable model of how the user speaks each such word. Most of the speech recognition systems made to date have had small vocabularies and thus this requirement of speaking each vocabulary word several times has been relatively acceptable burden. But in large vocabulary systems it becomes a very large burden. For example, in a system with 50,000 words in which the user is required to say each word five times, the user would be required to say 250,000 utterances in order to train the system. This would require saying one word every second, without interruption, for more than eight successive eight-hour work days. Clearly, the requirement of such training will discourage the use of large vocabulary systems unless a much simpler method for enrolling new users can be developed.
Another desirable goal involving large vocabulary systems is that of deriving more efficient means for representing the acoustic models associated with each of its words. Although the drop in the price and size of memory has somewhat reduced the need for deriving more efficient acoustic word representations, it certainly has not eliminated the desirability of more compact representations. For example, the acoustic representation technique described in the above mentioned U.S. patent application Ser. No. 797,249 requires at least 16 bytes for each node of each word. Since words typically involve five to ten nodes, that means that each word requires somewhere in the vicinity of 80 to 160 bytes to represent its acoustic model. At this rate, a fifty thousand word vocabulary would probably require more than five megabytes of memory. Thus it can be seen that it would be desirable to find a more efficient technique of storing acoustic word models.