1. This invention relates to speech understanding and in particular to a speech interpreter capable of real-time speech understanding.
1. Description of the Related Art
Real-time computer-based speech interpretation mimics two distinct human cognitive processes. First, a speech recognizer converts an incoming stream of sounds into a likely sequence of words. Next, a natural language processor attempts to make sense of these words, i.e., extract the meaning relevant to a specific user application from the sequence of words. Historically, due to computational limitations of general purpose computers, real-time speech analysis has been accomplished with the aid of elaborate dedicated hardware, such as digital signal processors. However, with recent increases in computational power of microprocessors, real-time speech interpretation of digitized speech can now be accomplished entirely by software running on a general purpose microprocessor-based computer.
FIG. 1 shows a conventional system for speech interpretation including a speech interpreter 100 and a target user application (not shown) which are primarily software programs and are both loaded into the memory (not shown) of a computer. Interpreter 100 includes a conventional speech recognizer 120, a conventional natural language (NL) processor 130 and an application command translator 140 coupled together in series. Speech interpreter 100 further includes an NL description compiler 135 coupled to NL processor 130 for converting an NL description into an NL grammar. A speech recognition (SR) grammar is provided to speech recognizer 120.
Speech interpreter 100 is also coupled to an acoustic input device 110. Acoustic input device 110 converts an input speech stream produced by a human user 105 into a machine readable form. In a digital computer implementation, acoustic input device 110 includes an analog-to-digital (A/D) converter for digitizing the speech. Next, using an internal finite-state machine (defined by the SR grammar) as a template, speech recognizer 120 converts the digitized speech into sequences of recognized words. NL processor 130 compares these sequences of words with internal patterns defined by the NL grammar from NL compiler 135, recognizes permissible sentences, and produces corresponding summary structures representing the permissible sentences for application command translator 140. Translator 140 then matches the summary structures with a known set of user application commands, returning the appropriate command(s) to the user application (not shown) for further processing.
At first blush, considering how facile the human listener is at understanding speech in real time, intuitively, programming computers to perform speech interpretation would seem to be a trivial undertaking. However, in reality, programming computers to perform real-time interpretation of continuous speech from a large variety of speakers using a large vocabulary while maintaining a low error rate has proven to be a very elusive undertaking. Presently, commercially acceptable error rates (greater than 90% of words correctly understood) have either been achieved by "training" the prior art speech recognizer 120 to work for only a small set of human speakers, i.e., a speaker-dependent system, and/or by severely limiting the vocabulary and grammar.
The speech interpretation problem is two-fold. Both SR and NL processing of the complete vocabulary of the English language are extremely formidable undertakings. Testing acoustical input sounds against an unrestricted language with a large vocabulary is not a realistic SR undertaking. The concept of perplexity is used to quantify the degree of difficulty of the SR task, i.e., the problems of which sequences of words are permissible combined with the problems associated with a vocabulary size. Perplexity is defined as the average number of "equally likely" words a speech recognizer has to choose among at an average word boundary. Hence, perplexity, in the absence of probabilities of words or of sequences of words, is equal to the total vocabulary size.
Fortunately, in many computer-based user applications, the vocabulary, i.e., command set, is fairly limited and the grammar defining permissible word sequences can be quite rigid, typically yielding a perplexity of about 50. The rigidity associated with such a sub-set of the English language makes real time NL understanding a realistic computational task for these user applications running on a microprocessor-based computer. As a result, conventional SRs and NL processors are able to process speaker-independent speech in real time with an acceptable success rate by using SR and NL grammars which significantly constrain the search of possible words and sequences allowed in the language.
Because their functions are so disparate, SR and NL processing present different challenges. SR involves the transformation of the digitized speech into "likely" sequences of words, while NL processing involves extracting meaning for a given application from the (possibly erroneous) input sequences of words. Furthermore, SR involves "shallow" processing of massive amounts of data, whereas NL understanding typically involves "deep" processing of much more modest quantities of data.
Prior art speech recognizers use a separate SR grammar to constrain the search to those sequences of words which likely are permissible in an application so as to improve the recognition rate and accuracy of speech recognizer 120. Typically, an SR grammar defines a simple finite state structure which corresponds to a relatively small number of permissible word sequences without any concern or need for deciphering the meaning or sense of these sequences. Because a typical SR finite state structure is a simple and "shallow" structure, conventional programming tools for expressing the SR grammars are equally primitive. As such, writing an SR grammar for a speech recognizer resembles the task of coding in an assembler programming language: it is very laborious and tedious, and the output is not easily comprehensible. Further, in order to achieve the highest performance in both space/time tradeoffs and to minimize the perplexity, SR grammars require adherence to idiosyncratic restrictions to accommodate the syntactical constraints imposed by a particular speech recognizer.
For example, in a login sequence for a user application, a list of permissible user full names is limited to "Bob Jones", "Jane Smith", "Jane Jones" and "Pete Baker". A typical SR grammar rule for such an application defining the alternative full names can be codified as:
$fullname: (Bob Jones.vertline.Jane Smith.vertline.Jane Jones.vertline.Pete Baker).sup.1 ; FNT .sup.1. A "$" symbol precedes an SR rule name and an SR rule element name. PA1 $fullname: $firstname $lastname; PA1 $firstname: (Bob.vertline.Jane.vertline.Pete); PA1 $lastname: (Jones.vertline.Smith.vertline.Baker);
The above SR grammar rule is accurate but is very expansive and tedious to implement for a large set of user names. A sophisticated speech recognizer may accept a slightly more powerful and flexible version of the SR grammar rule such as:
The problem with the second version is that it is too permissive, i.e., it does not have any means for constraining and eliminating invalid permutations of first and last names. As such, "Bob Smith" and "Pete Jones" will also be erroneously accepted by speech recognizer 120 and invalid full names will be forwarded to NL processor 130.
In contrast, NL processor 130 receives likely sequences of words (sentences) from speech recognizer 120 and compares each sentence with a plurality of internal NL patterns defined by the NL grammar in order to extract the "meaning" of the word sequence. This is accomplished by computing a phrase pattern for each sentence which corresponds to an NL pattern implicitly expressed by the underlying NL grammar definitions representing permissible phrase structures. For example, upon recognizing a word from a "name" list, an NL grammar rule may cause NL processor 130 to parse subsequent words searching for a word from a "verb" list.
These lists, e.g., "name", "verb", are expressed using a number of programing tools which simplify the generation of a description of an NL grammar. The tools include a lexicon (word library) defining permissible words and their features; these features are used for testing and for encoding meaning by the NL grammar. Another programming tool provides the capability for expressing a set of NL tests in an NL descriptive language thereby creating a more perspicuous mechanism for expressing a rule restriction. Similarly, "actions" (assignments) associated with the NL grammar support the extraction of "meaning" from the word sequences. Because coding an NL description is accomplished using a broader and more powerful set of programming tools, it more nearly resembles coding in a high level programming language, a task which is easier to accomplish efficiently and accurately.
The above described conventional speech interpreter 100 using separately generated SR and NL grammars has several disadvantages. These disadvantages arise primarily from the crucial need to harmonize the SR and NL grammars to ensure that speech recognizer 120 will function "cooperatively" with NL processor 130.
One problem is the critical tradeoff between the "tightness" of the SR grammar (controlling the perplexity encountered by speech recognizer 120) and the performance of NL processor 130. If the SR grammar is too tight, i.e., the word sequence filtering too restrictive, then a substantial amount of useful word sequences, such as slightly mispronounced words or sequences with inconsequential grammatical error(s), will be rejected by speech recognizer 120 and not reach NL processor 130. The unintended rejection of otherwise useful word sequences causes NL processor 130 to receive either a less likely word sequence or cause an outright loss of a command or answer. As a result, the user application will have to request a repetition of the command or answer and eventually annoy human user 105.
Conversely, if the SR grammar is too "loose", then speech recognizer 120 is guessing without adequate constraints necessary for achieving good recognition results causing NL processor 130 to receive too many incorrect word sequences, such that NL processor 130 is presented the equivalent of an avalanche of "word salads". As a result, the performance of NL processor 130 is degraded since speech recognizer 120 has failed to function as an effective front end filter.
Another difficulty arises whenever a change is made to the vocabulary or grammar of the user application, e.g., adding a new command/user-name or a new allowable formalism/expression of a command. Such changes, no matter how trivial, must be made to both the SR and NL grammars. These simultaneous changes must be harmonized or there will be a breakdown in the internal operation of speech interpreter 100. This harmonization requirement is even more problematic because of the generally weaker conventional formalisms available for expressing the SR grammars.
As discussed above, the SR and NL grammars control differing processes, word sequence matching versus meaning extraction, i.e., speech recognizer 120 is involved with the low level process of word recognition while NL recognizer 130 is involved in the relatively higher level process of understanding commands and/or answers from human user 105. For example, NL processor 130 focuses on abstracting the meaning of a word or phrase, thereby effectively treating the word pairs "I am", "I is", "I are", "Me am", "Me is" and "Me are" as having the same meaning, even though only the first of these word pairs is grammatically correct.
As such, it is counter-intuitive to attempt to unify the SR and NL processes (and their respective grammars). Using a high level language versus an assembler language analogy, generally a programmer would not write code in a high level language if the syntax and the order of the eventual individual object code instructions were of particular concern. This is because when coding in a high level language, the programmer is primarily concerned with the function of a routine rather than its exact representation, trading off control of the details for ease of expression.
Hence, there is a need to unify and consolidate the two presently distinct processes of implementing the SR and NL grammars by adopting the use of programming tools presently only available for generating NL grammars. Such a unification would drastically reduce the programming effort required to implement and maintain a harmonized set of SR and NL grammars. More importantly, the resulting unified grammar will be more robust and effective.