1. Field of the Invention
The present invention relates to syntax analysis of text documents and, more specifically, to a method and computer program product for building an abstract syntax tree for a faster Earley parser.
2. Background Art
Parsers are computer programs that perform syntax analysis of text documents. Based on such analysis, the parsers are also capable of performing some useful operations on the documents, such as translating a document into a different language, extracting requested information from the documents, etc.
Parsers are typically built from a number of components (i.e., stages) that form a pipeline. The input document is processed by the parser components in a sequential order so that an output of a stage n becomes an input of a stage n+1. A typical set of components used in a general-purpose parser includes a lexical scanner (commonly referred to as a “lexer”), a grammar recognizer, and an abstract syntax tree (AST) builder.
Parsers have many applications in different branches of computing. They are essential parts of compilers and compiler generators, interpreters, data mining and artificial intelligence systems. Modern parsers have found their way into computational biology and genetics as well. What makes parsers so useful in many applications is their ability to detect and recognize an internal structure of a text. The meaning of a particular word can change radically depending on the word position within a sentence. A simple albeit not very practical example demonstrates this structural dependency:
1. Book that flight;
2. Order this book.
The two rather simple sequences have the same word (book) in a different position within the sentence. Both sentences have the same grammatical structure that can be described as follows:
Sentence: Verb Preposition Object
A kind of a formal description of an acceptable sentence structure above is referred to as a grammar in special literature on formal (or mathematical) linguistics. The point of the above simple example, however, is that depending on a position of the word “book” in that grammatical structure (Verb or Object) the entire meaning of this word changes. Assuming that the parser is a part of an automated translator from English to other languages, an ability of the system to translate would depend on its ability to successfully match the input sentence (“Book that flight”) with one of the acceptable grammar structures (Verb Preposition Object).
When a sentence is submitted to the parser program, it first enters the lexical scanner (or a lexer). The lexer splits the input sentence into atomic pieces (i.e., words, lexemes) based on a set of rules defined (usually by a programmer) for a particular task. The lexer also determines which part of speech each lexeme belongs to. The output of the lexer is a sequence of pairs (part of speech, value) that preserves the original order of words in the input sentence. In the exemplary case above, the output produced by the lexer can be as follows: (Verb, book), (Preposition, that), (Object, flight).
The recognizer's task is to read the sequence that it received from the lexer and to determine if the input sentence (“book that flight”) satisfies the grammar structure for the sentence. In other words, the input sentence is accepted by the grammar. If the input sentence is found to be grammatically correct, it is passed further to the next stage (a parse tree builder). Otherwise, the recognizer rejects the sentence and reports an error.
The tree builder produces a data structure that is equivalent to a tree that consists of nodes (grammatical categories) and edges (relations between the nodes). For the above example, the syntax tree is depicted in FIG. 1. Note that the industry standard parsers are unable to handle ambiguity inherent to an arbitrary context-free grammar. The syntax trees preserve both recognized structure and values of the input text. This makes it possible to translate the exemplary sentence to German as shown in FIG. 2. However, the words for “to book” and “a book” are different in German. Knowing the structure of the sentence allows for a correct translation of the second sentence (Order that book), as shown in FIG. 3.
Jay Earley proposed a method in his dissertation in 1970 that allows for recognizing texts that belong to very complex grammars approaching natural languages (as opposed to programming languages that have relatively simple grammatical structure). The Earley method combines a power to successfully recognize complex and often ambiguous sentences with processing speed. It is also relatively easy to implement. All this makes the Earley method an ideal candidate for text processing applications that deal with complex and unstructured (often human generated) textual data. The Earley parser is the Earley recognizer combined with an AST builder. Jay Earley himself proposed the original abstract syntax tree (AST) algorithm for this method. Construction of parse trees in the original Earley parser is done after the recognition is completed based on the information collected and retained by the recognizer.
As mentioned above, the Earley parsing method is powerful enough to successfully recognize texts that belong to any context-free grammar. However, the amount of run time processing consumed while recognizing an input text is rather large compared to less powerful, but faster table-driven methods. Different variations of Earley method that incorporated table-driven techniques while preserving parsing power of the original algorithm had been developed. One particular approach developed by McLean and Horspool is adopted as a foundation of the present invention. The authors named their method Left Recursive Earley (LRE). The LRE combines a method of states pre-computation that comes from LR parsers with the Earley recognizer.
The efficiency of the parser pipeline is determined by the least efficient stage. Therefore, finding a way to improve the efficiency of just one component can significantly increase the efficiency of the entire system. Typically, an AST builder takes longer time that the other stages.
Accordingly, there is a need in the art for a method for more efficient AST builder using a faster Earley parser technique.