1. Field of the Invention
This invention relates to spoken language interfaces, and more particularly to a spoken language processor containing a chart parser that incorporates rule and observation 10 probabilities with stochastic unification grammars.
2. Description of the Related Art
It has been the goal of recent research to make machine understanding of spoken language possible through a tight coupling between speech and natural language systems. The difficulty posed by this coupling lies in trying to integrate statistical speech information with natural language grammars. Furthermore, speech systems have come to rely heavily on grammar constraints to accurately recognize connected speech.
Language modeling has become an essential element in high performance, speaker-independent, continuous speech systems. Until recently, speech recognition systems have primarily used Finite State Automatons (FSAs) as the language model. These models offer efficient processing, easily accommodate observation probabilities, and permit simple training techniques to produce transition probabilities. Attempts to model spoken language with FSAs have resulted in stochastic language models such as bigrams and trigrams. These models provide good recognition results when perplexity can be minimized, but preclude any direct support for spoken language systems by eliminating the semantic level.
Language models have traditionally proven valuable in natural language systems, but only during the past decade have computationally-oriented, declarative grammar formalisms become widely available. These formalisms, generally known as unification grammars, offer great flexibility with respect to processing for both parsing and generation. Unification grammars have allowed the close integration of syntax, semantics, and pragmatics. These grammars are especially significant for spoken language systems because syntactic, semantic, and pragmatic constraints must be applied simultaneously during processing. Discourse and domain contraints can then limit the number of hypotheses to consider at lower levels, thereby greatly improving performance.
Several efforts have been made to combine Context-Free Grammars (CFGs) and unification grammars (UGs) with statistical acoustic information. Loosely coupled systems, such as bottom-up systems or word lattice parsers, have produced nominal results, primarily due to time alignment problems. Top-down constraints from CFG's have been integrated with speech using the Cocke-Younger-Kasami (CYK) algorithm, but this algorithm has bad average time complexity (cubic). This complexity becomes especially detrimental with frame synchronous parsing where the number of input hypotheses under consideration is typically large. For example, a person's average speech input is 4-5 seconds long, which corresponds to 400-500 frames. When cubed, those 400-500 frames yield 64,000,000-125,000,000 processing steps to recognize an input.
The first algorithm that had N.sup.3 time complexity was the CYK algorithm. With natural language systems, N is traditionally words; however with speech systems, N is equal to frames, which are the fundamental time units used in speech recognition. The algorithm was significant in that it provided a time synchronous algorithm for speech recognition which improved accuracy because the processor did not have to be concerned about how the words fit together. The CYK algorithm is very simple in that it resembles matrix multiplication. Unfortunately it always takes N.sup.3 time, rather than linear time, even when processing a regular grammar. Additionally, the CYK algorithm is exhaustive; it systematically expands everything whether it will be needed or not. Such an algorithm uses a great deal of processing time and memory space.
Earley's algorithm (J. Earley, `An Efficient Context-Free Parsing Algorithm`,Comm. of the ACM, Vol. 13, No. 2, Feb. 1970, pp. 94-102) is one of the most efficient parsing algorithms for written sentence input and can operate in linear time for regular grammars. It was one of the first parsing methods that used a central data structure, known as a chart, for storing all intermediate results during the parsing process of a sentence. Subsequently chart parsers were widely used in natural language systems for written input.
Due to the variability and ambiguity of a spoken input signal, modified algorithms were created to improve on Earley's algorithm and to adapt it to spoken language recognition. An example of such a modified algorithm is shown in A. Paeseler, `Modification of Earley's Algorithm for Speech Recognition`, Proc. of NATO ASI, Bad Windsheim, 1987. Paeseler's algorithm combines probabilities in context-free grammars based on Earley's algorithm but it does not do so optimally due to certain defects in the algorithm. One such defect involves the calculation of probabilities. For context-free grammars, a nonterminal symbol may rewrite to another nonterminal symbol without having to go through a terminal symbol. Probabilities can therefore occur from many directions in the grammar. In order to progress in the parsing of the input, those subsequent symbols must also be expanded. But in order to expand those subsequent symbols, according to Paeseler's algorithm, the best probability must be known, otherwise if a better probability appears, the parsing must be redone. This means potentially an exponential amount of work, which is highly undesirable.