The present invention relates to methods and apparatus for automatic recognition of speech. Using a psycholinguistic approach to solve the problem of automatic speech recognition (ASR), a large vocabulary, speaker-independent, continuous speech recognizer that requires a minimal amount of processing power has been developed.
The standard approach to speech recognition in the prior art has been to maintain large files of speech spectra or "templates." In-coming speech (either words or phonemes) is converted into spectrograms and compared to the templates. If the speech matches a template, the system "recognizes" that bit of speech. That approach requires extensive amounts of computing space and processing power to store and search the templates for a match. Since no two speakers sound exactly alike, their spectrograms don't look alike. Using a template matching system, high accuracy, especially for a vocabulary of over 12 words, almost demands speaker-dependence; the system has to be trained by the individual who wishes to use it.
The present invention demonstrates that it is technically possible to produce a speaker-independent, large vocabulary, continuous speech recognizer using relatively limited processing power. The embodiment described below is significant in two regards since it shifts from off-line phonetic recognition to only four-times real-time recognition of a 39-word vocabulary at a 95% phonetic accuracy rate, and it uses an approach that had been considered theoretically impossible heretofore.
In the case of vocabulary size, the input speech database of 45 words did not contain enough instances of some words to permit accurate recognition analysis. Other important features are:
95%+ phonetic recognition accuracy; continuous speech; PA1 speaker independent for most dialects of American English; PA1 39 avionic vocabulary words; command grammar for cockpit functions; PA1 grammar constrained to valid ranges; incorrect speech noted PA1 accuracy tested against digitized database
The present invention is an extremely innovative approach to the problem of speaker-independent, continuous speech recognition, produced by a near real-time system with a 95% phonetic recognition accuracy rate on a 45-word vocabulary, while using only the minimal processing power available on a Macintosh IIx. No other speech recognition technology that is currently available or that appears to be on the horizon has these capabilities.
The present invention encompasses the derivation of articulatory parameters from the acoustic signal, the analysis and incorporation of suprasegmental data, and the inclusion of a limited context-free grammar (as contrasted with the finite state grammars typically used in speech recognizers). All elements of this approach are dramatic refutations of limits previously held to be theoretically inviolate. The derivation of articulatory parameters is described in Applicant's U.S. Pat. No. 4,980,917; context-free grammars are described in Applicant's U.S. Pat. No. 4,994,966. Both are hereby incorporated by reference.
In the newest portion of the process, phonetic output from a front end is translated into words that can be handled by the grammar.
The prescribed level of accuracy has been met on a very large database of speakers from the Midwest, Texas, California, the Pacific Northwest, the Rocky Mountain states, and parts of the South. Dialects not represented in the database include the Northeast (both New York and Boston) and the Deep South.
To measure the progress of recognition accuracy it was essential to create a fixed database of files against which performance could be accurately and quickly measured. Among other things, the establishment of a fixed database defined the dialect pool on which recognition could be fairly assessed, since random speakers testing the recognizer in real-time could easily have, and frequently did have, dialect background that the system was not prepared to handle. The database provided a relatively quick means of assuring that algorithm changes did not have a serious negative impact on previously handled words.