The present invention relates to methods and apparatus for automatic recognition of speech.
Using a psycholinguistic approach to solve the problem of automatic speech recognition ("ASR"), a large vocabulary, speaker-independent, continuous speech recognizer that requires a minimal amount of processing power has been developed.
To date, the standard approach to speech recognition has been to maintain large files of speech spectra, or "templates," and in-coming speech (either words or phonemes) is converted into spectrograms and compared to the templates. If the in-coming speech matches a template, the system "recognizes" that bit of speech. Such an approach requires extensive amounts of memory and processing power to store and search the templates for a match. Since no two speakers sound exactly alike, their spectrograms don't look alike. Using such a template matching system, high recognition accuracy--especially for a vocabulary of more than about twelve words--almost demands speaker-dependence, i.e., the system has to be trained by the individual who wishes to use it.
The present invention provides a speaker-independent, large vocabulary, continuous speech recognizer using relatively limited processing power. The embodiment described below in detail, although directed to use in an aircraft cockpit environment, is significant because it shifts from off-line phonetic recognition to only four-times real-time recognition of a 39-word vocabulary at a 95% phonetic recognition accuracy rate and it uses an approach that had been considered theoretically impossible heretofore. It will be appreciated that, although the embodiment described below is based on a 39-word vocabulary, the invention is not limited in vocabulary size. Other important features of the aircraft cockpit embodiment are:
continuous speech recognition; PA1 speaker independent for most dialects of American English; PA1 command grammar for cockpit functions; PA1 grammar constrained to valid ranges; PA1 incorrect speech noted; and PA1 accuracy tested against digitized database.
The present invention is an innovative approach to the problem of speaker-independent, continuous speech recognition, and can be implemented using only the minimal processing power available on MACINTOSH IIx-type computers available from Apple Computer Corp. The present speaker-independent recognizer is flexible enough to adapt, without training, to any individual whose speech falls within a dialect range of a language, such as American English, that the recognizer has been programmed to handle. It will be appreciated that the present ASR system, which has been configured to run on a Motorola M68030 processor or above, can be implemented with any processor, and can have increased accuracy, dialect range, and vocabulary size compared to other systems. No other speech recognition technology that is currently available or that appears to be on the horizon has these capabilities.
The present invention encompasses the derivation of articulatory parameters from the acoustic signal, the analysis and incorporation of suprasegmental data, and the inclusion of a limited context-free grammar (as contrasted with the finite state grammars typically used in speech recognizers). All elements of this approach are dramatic refutations of limits previously held to be theoretically inviolate. The derivation of articulatory parameters is described in Applicant's U.S. Pat. No. 4,980,917 limited context-free grammars are described in Applicant's U.S. Pat. No. 4,994,996. Both are hereby incorporated by reference. As described further below, phonetic output from a front-end based on technology disclosed in the former patent is translated into words that can be handled by the grammar based on the technology in the latter patent.
The recognition accuracy of the aircraft cockpit embodiment described below was determined by creating a fixed database of real speech against which system performance could be accurately and quickly measured. Among other things, the establishment of a fixed database defined the dialect pool on which recognition could be fairly assessed, since random speakers testing the present recognizer in real-time could easily have--and frequently did have--dialect backgrounds that the system was not prepared to handle. The database provided a relatively quick means of assuring that modifications did not have a serious negative impact on previously well-handled words. In this way, the 95% level of accuracy was met in the aircraft cockpit embodiment on a database of speakers from the Midwest, Texas, California, the Pacific Northwest, the Rocky Mountain states, and parts of the South.
Dialects not represented in the database included the Northeast (both New York and Boston) and the Deep South. Although a database comprising a range of speakers covering the area west of the Mississippi and south of the Mason-Dixon line was intended, a database that was weighted to California and Texas residents was achieved. However, given the highly transient populations in Texas and California--everyone seems to have come from somewhere else--there were relatively few true speakers in the database with Texas and California accents (i.e., people who were born and grew up there).
The collection of data had originally been intended to duplicate as closely as possible the way that users of the recognizer would speak. Hence, many of the speakers in the original database were experienced pilots, thoroughly familiar with the command language used in the cockpit embodiment and accustomed to speaking rapidly. On the other hand, speakers in a demonstration environment may not be familiar with the command language, or may have expectations of permissible speaking rates that were set by other speech recognition technologies. Whatever the cause, it was noted that "demo speakers" spoke much more slowly than pilots, which could lead to straightforward recognizer adjustments to handle the slower speech.
In the aircraft cockpit embodiment described below in detail the fixed database consisted of 45 commands, but because of an insufficient number of speech samples for some of the commands, 95% recognition accuracy was achieved on only 39 commands. Although memory per word is not in itself an overriding concern, the time required to search increased memory contributes directly to processing delays. It will be understood that the present invention is not limited in language or size of vocabulary, but as the vocabulary grows, the per-word memory requirement becomes increasingly important.