1. Field of the Invention
This invention relates to a method and system for morphological analysis and in particular to a morphological look-up. Morphological analysis represents the basic enabling technology for many kinds of text processing. Recognition of word forms is the first step towards part of speech tagging, parsing, translation and other high level applications. The method and system are applicable to all natural languages.
2. Description of Related Art
Many natural language processing tasks require the morphological analysis of all running tokens in an input document. By means of morphological analysis, any information that is encoded in a word can be extracted and output in order to present it to later layers of text processing. This stream of morphological analyses can be extracted from a raw document by using the standard architecture which is depicted in FIG. 4.
Before any text processing can be performed on a raw document 400, the text must be broken up into distinct and meaningful units. This procedure is called tokenizing and each meaningful unit or token in general is delimited from other meaningful units by a particular character or other symbol. A description of a possible method and apparatus for tokenizing text is for instance given in the U.S. Pat. No. 5,721,939. The tokenizing procedure is symbolized in FIG. 4 by the tool tokenize 402, which can be a linguistic service of XeLDA (Xerox Linguistic Development Architecture) developed by Xerox Corporation. This module uses a first finite state transducer 404 built from declarative specifications of the transformation to be done in the tokenizing procedure. The raw document 400 as a result is transformed into a token stream 406. In a next step, each token of the token stream 406 is subject to a morphological look-up procedure performed by the look-up tool 408. The morphological analysis is done by means of a second finite state transducer 410. As a result, a stream of morphological analyses 412 is generated and can be the basis of a further natural language processing task. The technology which is underlying the operations shown in FIG. 4 has been developed at Xerox PARC and XRCE and is described in numerous publications and patents.
In the ideal case, both tokenize 402 and look-up 408 can be quite fast, requiring essentially only one or a few table look-up operations per input character. However, the analysis of a token against the morphological transducer 410 can be much slower than desirable and may involve a backtracking search for the right path in one or several lexical transducers. In extreme cases, the procedure of morphological look-up can require time that is exponential in the length of the input token. This occurs mainly when an exponential number of intermediate results has to be generated, for example in the re-accentuation of French words, from which accents have been removed. For some applications, the speed of morphological analysis is a major concern. In information retrieval systems, it may be required to analyze many gigabytes of texts in a limited time frame, especially when documents are frequently updated. Ongoing theoretical and practical work on algorithms for the partial sequence realization of finite state transducers has already lead to important progress. However, the urgent need for fast implementations of morphological look-up motivates the search for short term solutions, which may achieve a partial effect without changes in the underlying look-up machinery and transducers.
Given these problems with the existing technology, it would be advantageous to provide a method and system that uses a cache memory to avoid repeated morphological look-up of the same token in the processing of large documents and thereby increases the processing speed substantially.
It would be further advantageous to provide a system being modular and not requiring modifications of the existing machinery.
It would also be advantageous to provide a system that allows an optimization of the parameters processing speed and memory capacity.