Discrete large-vocabulary speech recognition systems have been available for use on desktop personal computers for approximately twelve years by the time of the writing of this patent application. Discrete speech recognition can only recognize a single set of one or more recognition candidates, each consisting of one vocabulary word, per utterance, where a vocabulary word, for example, can correspond to a single word, a letter name, or even a multiword phrase the system treats as one word. Continuous speech recognition, on the other hand, can produce a sequence of sets of one or more recognition candidates, each consisting of one or more vocabulary words in response to a single utterance. Continuous large-vocabulary speech recognition systems have been available for use on such computers for approximately seven years by this time. Such speech recognition systems have proven to be of considerable worth. In fact, much of the text of the present patent application has been prepared by the use of a large-vocabulary continuous speech recognition system.
As used in this specification and the claims that follow, when we refer to a large-vocabulary speech recognition system, we mean one that has the ability to recognize a given utterance as being any one of at least two thousand different vocabulary words at one time, with the recognition depending upon which of those words has corresponding phonetic or acoustic models that most closely match the given spoken word.
As indicated by FIG. 1, large-vocabulary speech recognition typically functions by having a user 100 speak into a microphone 102, which in the example of FIG. 1 is a microphone of a cellular telephone 104. The microphone transduces the variation in air pressure over time caused by the utterance of one or more words into a corresponding waveform represented by an electronic signal 106. In many speech recognition systems this waveform signal is converted, by digital signal processing performed either by a computer processor or by a special digital signal processor 108, into a time domain representation. Often the time domain representation comprises a plurality of parameter frames 112, each of which represents properties of the sound represented by the waveform 106 at each of a plurality of successive time periods, such as every one-hundredth of a second.
As indicated in FIG. 2, the time domain, or frame, representation of an utterance to be recognized is then matched against a plurality of possible sequences of phonetic models 200 corresponding to different words in a large vocabulary. In most large-vocabulary speech recognition systems, individual words 202 are each represented by a corresponding phonetic spelling 204, similar to the phonetic spellings found in most dictionaries. Each phoneme in a phonetic spelling has one or more phonetic models 200 associated with it. In many systems the models 200 are phoneme-in-context models, which model the sound of their associated phoneme when it occurs in the context of the preceding and following phoneme in a given word's phonetic spelling. The phonetic models are commonly composed of the sequence of one or more probability models, each of which represents the probability of different parameter values for each of the parameters used in the frames of the time domain representation 110 of an utterance to be recognized.
One of the major trends in personal computing in recent years has been the increased use of smaller and often more portable computing devices.
Originally most personal computing was performed upon desktop computers of the general type represented by FIG. 3. Then there was an increase in usage of even smaller personal computers in the form of laptop computers, which are not shown in the drawings because laptop computers have roughly the same type of computational capabilities and user interface as desktop computers. Most current large-vocabulary speech recognition systems have been designed for use on such systems.
Recently there has been an increase in the use of new types of computers such as the tablet computer shown in FIG. 4, the personal digital assistant computer shown in FIG. 5, cell phones which have increased computing power, shown in FIG. 6, wrist phone computers represented in FIG. 7, and a wearable computer which provides a user interface with a screen and eye tracking and/or audio output provided from a head wearable device as indicated in FIG. 8.
Because of recent increases in computing power, such new types of devices can have computational power equal to that of the first desktops on which discrete large-vocabulary recognition systems were provided and, in some cases, as much computational power as was provided on desktop computers that first ran large vocabulary continuous speech recognition. The computational capacities of such smaller and/or more portable personal computers will only grow as time goes by.
One of the more important challenges involved in providing effective large-vocabulary speech recognition on ever more portable computers is that of providing a user interface that makes it easier and faster to create, edit, and use speech recognition on such devices.