1. Field of the Invention
The present invention relates to a method and device for morphological analysis of language text in electronic form without using a dictionary, by utilising morphological analysis, in particular a probabilistic technique.
2. Description of the Related Art
Morphological analysis processing is extremely important in language processing; also morphological analysis processing in Japanese language processing is extremely important for further processing such as syntactic analysis processing. In particular, with the spread of text composition using word processors and the spread of the Internet large amounts of Japanese language text in electronic form can be easily acquired. In order to perform processing such as lookup, composition, comparison, categorisation, and summarisation of such texts with word processors or other computer devices, an overriding precondition is to be able to pick out semantic units such as words or phrases in the text, in other words, to be able to perform morphological analysis correctly. If this morphological analysis is incorrect, it is difficult for the error to be corrected in subsequent processing such as syntactic analysis processing or semantic analysis processing. Even if such correction is possible, processing is made more complicated, so it becomes impossible to process a large quantity of text within the expected time. In morphological analysis processing, compared with language texts such as English which have an orthography in which a space is employed as a word division symbol, in languages such as Japanese that have no word division, how to achieve inference of parts of speech and word division with high accuracy and high speed constitutes a considerable challenge.
The same problem as described above is found in languages such as Korean, Chinese and Thai that, like Japanese, have an orthography with no word division.
In techniques for morphological analysis of English, in which words are separated by word separators (spaces) and it is sufficient simply to allocate a tag such as a part of speech to a word, a technique has been established of inferring from a large text a probabilistic model of parts of speech or tag sequences representing the arrangement thereof and, further, of correcting errors using examples. In regard to the Japanese language also, there are several examples of proposals for applying this technique which is used in English. An example using a probabilistic model is proposed in "Japanese Language Letter Recognition Method and Device" which is proposed in reference I: "TOKKAIHEI, i.e. Japanese Unexamined Patent Publication No. 8-315078" which was applied for by NTT.
As already known, in order to find an optimum morphological analysis result by a probabilistic model, a morpheme sequence and tag sequence may be found such as to maximise the joint probability of the morpheme sequence and the tag sequence attached to each morpheme. The joint probability means the probability that a given candidate morpheme and candidate tag sequence occur simultaneously. In English, since the word separators are known, the morpheme sequences are fixed, so an optimum tag sequence can be inferred. However, in languages such as Japanese, or Korean, Chinese or Thai in which no word divisions are made in writing the word separations are not clear, so there is no alternative to comparing the probability of word sequences at all possible word separations. However, since respective word sequences have different word sequence lengths depending on different ways in which this word division is effected, a condition regarding length is included as an approximation in order to compare these word sequences of different length.
A simple description of this point is given below taking the Japanese language as an example. Morphological analysis consists in finding, for a given input text, the optimum morpheme sequence W and tag sequence T for the input character sequence. This can be achieved by selecting a chain probability model in which the joint probability p(W,T) of the morpheme sequence W and tag sequence T is maximised. In general, the chain probability model of expression (1) below is employed (see reference I). The chain probability means the probability that a given n (where n is an arbitrary number) of characters appear consecutively. ##EQU1## where i is the char position, w.sub.i is the morpheme at character position i in the morpheme sequence, t.sub.i is the tag at the character position i in the tag sequence, and N is the number of characters in the character group that is referenced: usually, N=1 or 2 or 3. Length (W) is the length of the input word sequence i.e. is the number of words constituting the input text.
The chain probability model expressed by this expression (1) is referred to hereinbelow as the part of speech N-gram model. Since in this expression (1) a condition based on the length, length (W) of the input morpheme sequence is included, strictly speaking, an approximation regarding length (W) is included in p(W,T) of expression (1). In the case of English, since the length of a morpheme sequence is fixed, there are no problems when finding the maximum probability p(W,T). However, in the case of Japanese, since the morpheme separators are not known, it is necessary to obtain a morpheme network (constituting a semi-ordered relationship) using the character sequence of the input text and a dictionary and then to calculate the probabilities of all paths in this semi-ordered structure using a part of speech N-gram model. When this is done, in the case of Japanese, since the morpheme separators are not given, it is necessary to compare the probabilities of morpheme sequences of different length (length (W)). As a result, in expression (1) whereby approximation is effected with a probability conditioned by length, which gives no problems in the case of English, in the case of Japanese, one more approximation stage is required. That is, in contrast to what is the case with English, in the case of Japanese, the chain probabilities in respect of all possible candidates are not compared under the same conditions.
By the approximation, morpheme sequences for which the number of divisions is fewest (the morphemes are longer) are prioritised. The reason for this is that, since the number of possible sequences is greater for longer sequences, the average chain probability for a single possibility becomes smaller.
Furthermore, if, because the input character sequence is an unknown word, this character sequence is not present in the dictionary, a fresh problem arises in the probabilistic model analysis technique. In the case of English, no special improvement of the probabilistic model is required since it suffices, even in the case of an unknown word, simply to consider all possible tags for this unknown word. Also, since the number of possible tags is comparatively few (a few tens) the part of speech can be correctly deduced with considerable accuracy. However, in the case of an unknown word in Japanese, it is necessary to consider all possible positions of the character series constituting the unknown word (i.e. at which position should it be divided?), all possible lengths (i.e. what is the character structure of the word?) and all possible combinations in regard to the respective morphemes, so this represents an amount of calculation that cannot be implemented with a simple probabilistic model.
Also, if an unknown word is present, the dictionary cannot be used, so a semi-ordered structure cannot be obtained.
With the technique disclosed in reference I, unknown words are dealt with by introducing a word model using the chain probability of characters in respect of unknown words. However, with this technique, only the chain probability within the word is employed; how probable this word is in the light of the preceding and following context can be represented only indirectly by chain probability of parts of speech. That is, it is not possible to recognise or divide up correctly unknown character sequences without using the chain probability (in a character sequence going beyond the range of the unknown character sequence) of the entire context.
Also, since this prior art technique is solely a word-based technique, if the morphological analysis system provisionally concludes that there is an unknown word, combinations of candidate words of arbitrary length must be considered at all locations in the text: this therefore increases the amount of computation.
The problems described above will be summarised as follows:
1) Since the morphological analysis technique disclosed in the reference is word-based, in the case of Japanese, a dictionary is indispensable. However, even if a dictionary is provided, if an unknown word is present, the dictionary cannot be used, so word division is affected.
2) In the case of Japanese, owing to the ambiguity of word division, the probabilistic model used in the case of English, in which the number of divided words is fixed, cannot be applied without modification. For example, if two modes of division giving different numbers of divided words are compared, the mode of division that involves fewer divisions i.e. that produces the longest words will tend to obtain a significant evaluation value.
3) Due to the above problem 1), the following fresh problem is created as regards processing efficiency. This is that, with the prior art method, since it is word-based, a dictionary is indispensable merely in order to divide up the words. The troublesome task of compiling a dictionary is therefore essential and resources to store this dictionary are also required. Furthermore, during execution of processing, there is a large memory requirement and processing time is prolonged by referring to the dictionary.
Accordingly, there was previously a demand for realisation of a method and device for morphological analysis and a method and device for morphological analysis of Japanese wherein, even though a probabilistic technique is employed, use of a dictionary is not needed and morphological analysis processing can be achieved with high accuracy and high speed without the probability calculation depending on the number of words into which division is effected and yet in which economies in regard to resources are possible.