The present invention relates to apparatus and methods for performing morphological analysis of natural language text using a computer.
Morphological analysis is an important procedure in natural language processing using computers for analyzing natural language text such as a sentence (hereinafter referred to as an input sentence or simply a sentence), decomposing it into words, and estimating the parts of speech of the words of the text.
Morphological analysis is largely performed in three steps: retrieving a word candidate, retrieving a concatenation rule for the part of speech corresponding to each word candidate, and searching for an optimal solution with respect to the retrieved word candidate and the concatenation rule.
Herein, the retrieval of the word candidate is a process of segmenting an input sentence into appropriate character strings, collating each word with a prepared word dictionary (database), and acquiring the words (word candidates) that can make up the input sentence.
The retrieval of the concatenation rule for the part of speech is a process of acquiring the concatenation rule possibly used for the concatenation of each word by referring to a prepared grammar dictionary (database) based on the part of speech of each word. The part of speech may depend on the context in which the word candidate is used. Concatenation rules for all the parts of speech (part of speech candidates) that can be taken by the words obtained by retrieval of word candidates are thus acquired.
The search for an optimal solution is a process of combining the words in accordance with the concatenation rules, and selecting the optimal word string to make up the input sentence.
Usually, a sentence gives rise to a plurality of “word candidates” and “part of speech candidates.” Generally, the process for dealing with candidates can be treated as a state transition, whereby a morphological analysis system using the “Non-deterministic Finite-state Automaton (NFA)” is implemented. Such an approach is described by Hiroshi Maruyama and Shiho Ogino in “Japanese Morphological Analysis Based On Regular Grammar,” Information Processing Society of Japan, Japan Information Processing Society, July, 1994, Vol. 35, No. 7, p. 1293-1299.
FIG. 11 is a diagram showing a morphological analysis process with the conventional morphological analysis system.
As an example, morphological analysis of the input sentence “Nanpo ha natsu da,” which in English reads “It is hot in the south,” will be considered.
First, the retrieval of word candidates is performed, thereby acquiring the word candidates “nanpo (noun)”, “ha (particle)”, and “natsu (noun).”
Then, the concatenation rule for the part of speech is retrieved, whereby the concatenation rule corresponding to the part of speech for each word candidate is acquired. In the example, the concatenation rules “noun: clause head→noun tail” for noun, “noun: prefix→noun tail” for noun, and “particle: particle head→case particle” for particle are acquired. This means that in the concatenation rule “noun: clause head→noun tail,” for example, when a predetermined noun exists in the sentence, a concatenation from clause head to noun tail before and after the noun is permitted. Herein, the clause head and the noun tail denote the states (nodes) when the search for an optimal solution is performed with the NFA. That is, in the concatenation rule “noun: clause head→noun tail,” the NFA corresponding to the input sentence transits to the state “noun tail”, if the noun is input in the state “clause head.”
Also, the NFA for searching for optical solution may make the null transition (ε transition) from a certain state to another state without accompanying the input character. For example, when in the state “noun tail,” the NFA can transit to the state “particle head” through the null transition, whereby the particle can be concatenated after the noun. Accordingly, the information regarding this null transition is also stored in the grammar dictionary as the concatenation rule for the part of speech, whereby the concatenation rules required for retrieval of the concatenation rule for the part of speech are acquired. In the example of FIG. 11, the concatenation rules for the noun tail “null: noun tail→particle head” and “null: noun tail→numeral head” are acquired.
To search for the optimal solution after the retrieval of words and concatenation rules, the NFA is created as shown in FIG. 11. In the example, the NFA transits from state “clause head” to state “noun tail” responsive to input of the noun “nanpo.” Then, the NFA is transitable from state “noun tail” to “numeral head,” “particle head,” or “English head” through the null transition, and further transitable from state “numeral head” to “Chinese numeral,” whereby all the possible states are generated. And since the particle “ha” exists after the word “nanpo” in the actual input sentence, the state “particle head” is selected by input of the particle “ha,” and the NFA transits to the state “case particle.”
As described above, in the morphological analysis, the non-deterministic finite-state automaton (NFA) can be used to search for the optimal solution from the word candidate obtained and the concatenation rule for the part of speech.
However, generally, in the NFA, the next state is not uniquely determined for the same state and the same input. Thus, there is a tendency to increase the number of search paths. Accordingly, morphological analysis systems using the NFA are slow.
To improve the processing speed of a morphological analysis system using an NFA, it is conventionally common practice that the concatenation rules for the parts of speech are preprocessed to remove the null transition from the NFA. In the example of FIG. 11, the transition from state “clause head” to state “particle head” is known in advance only by referring to the “concatenation rule for the part of speech” before starting the processing. Thus, the concatenation rule of “noun: clause head→noun tail” as the non-null transition and the concatenation rule “null: noun tail→particle head” as the null transition are integrated beforehand into the “noun: clause head→particle head,” whereby the equivalent processing is enabled without treating the null transition. This method is employed and widely known in fields other than morphological analysis as the transformation for removing the null transition from the NFA.
As is apparent from FIG. 11, however, most of the states generated in dealing with the “concatenation rule for the part of speech” are discarded at the point of time when the word is actually input in the morphological analysis. In the example of FIG. 11, most of the states “noun tail,” “numeral head,” “Chinese numeral,” and “English head” are discarded because there is no possibility of continuing transition at the point of time when the word “ha” following the word “nanpo” is input. Thus, it is inefficient to generate many states that may be discarded immediately after generation.
Also, to search for the optimal solution using the NFA, a collation operation for checking whether or not the concatenation rule is acceptable by retrieving the concatenation rule between predetermined words and actually inputting the concatenation rule into the NFA is performed successively for each word and each concatenation rule. For example, after input of the word “nanpo,” if the optimal concatenation rule corresponding to the succeeding word “ha” is searched in FIG. 11, the concatenation rules for particle and noun are input into the NFA for the collation operation, because the word “ha” has the concatenation rule as not only the particle but also the noun. Therefore, in the case where there are many corresponding concatenation rules, the number collation operations increased, and the searching process for the optimal solution is very burdensome.