1. Field of the Invention
The present invention generally relates to methods for language parsing and devices for the same, and particularly relates to a method for language parsing and a device for the same which extract constituents defined by a context free grammar with regard to terminal symbol strings.
2. Description of the Prior Art
In a syntactic parsing of natural languages, syntactic ambiguities arises as a problem. For example, there are two different interpretations for a single English phrase `A of B of C`, i.e., ((A of (B of C))) or (((A of B) of C)). Here, brackets indicate constituents (syntactic groups of words). When there are more than one syntax as in this example, those syntaxes are called syntactic ambiguities. As a length of a sentence becomes large, the number of syntactic ambiguities expands rapidly. The problem is that, when a syntactic parsing derives all possible syntaxes for a long sentence, a processing time and a memory volume required for processing become enormous, because of the syntactic ambiguities.
Methods of extracting all phrase structures as efficiently as possible have been proposed for such grammar as of natural languages, which involves syntactic ambiguities. One of such algorithms is proposed in Efficient Parsing for Natural language by M. Tomita, Kluwer Academic Publishers, 1985, p. 33. The algorithm of this reference augments the LR parsing developed for programming languages, which involves no syntactic ambiguities, so as to use it for natural languages, which involves syntactic ambiguities. This augmented LR parsing is called the general LR parsing, and can carry out parsing more efficiently than do the Earley's algorithm or the chart parsing, which do not use a LR table.
As for the LR parsing, reference may be made to Basics of Natural Language Parsing by Hozumi Tanaka, Sangyo-Tosyo Publishers, pp. 83-104, 1989. The LR parsing scans a sentence from the left (beginning of the sentence) to the right (end of the sentence) by using a stack, and carries out a deterministic parsing applying shift operations and reduce operations to the stack, while looking up information obtained from a stack state and from k words looked at ahead of schedule. The letter L of the LR parsing means a scan of a sentence from the left to the right, and the letter R means a right most derivation.
In the LR parsing, an LR table (LR parse table) is extracted from given LR grammar. The LR table is divided into two parts. One part is an ACTION part which defines a state to be selected for each occurrence of shift and defines a rule to be used for each occurrence of reduce. The other part is a GOTO part which defines a state to be selected upon each reduce operation.
In the prior art method, however, a stack called a graph-structured stack is used, which complicates the mechanism of the method. Also, an amount of data stored with this data structure becomes too large to be overlooked during the parsing. Furthermore, depending on the type of a language to be analyzed, there is a need to frequently create and purge data of this data structure. This leads to an efficiency of the processing time being reduced.
Accordingly, there is a need in the field of a syntactic parsing for a method and a device which can carry out an efficient and speedy language parsing by using an LR table with a simple mechanism using a small memory volume irrespective of the type of language.