1. Field of the Invention
The present invention relates to machine translation systems and, more particularly, to machine translation systems having an idiom processing function which are capable of registering, retrieving and translating idioms.
2. Description of Related Arts
Conventional language processing systems for practical use include word processors for supporting documentation and machine translation systems for translating a document from one language to another language.
These language processing systems have a dictionary for storing therein a multiplicity of unit items each including a header and various kinds of information associated thereto. Headers include not only words of a natural language such as English or Japanese but also sequences of words (such as phrases and correlated words) which are taken as a whole to express a certain meaning, i.e., idioms. Some consist of consecutive words like "high school", and others consist of split words like "so . . . that" (split idioms).
In a translation process, the interrelation between a split idiom of a source language and an equivalent expression of a target language varies case by case and, hence, it is difficult to handle the split idioms. Exemplary split idioms are shown below: EQU I have "both" A "and" B. (1a) EQU I have "neither" A "nor" B. (1b) EQU This is "so" hot "that" children cannot drink it. (2a) EQU This is "too" hot "to" drink it. (2b) EQU This is "the same" book "that" you bought. (2c)
A conventional machine translation system introduces representative symbols as shown in FIG. 15 when registering such split idioms into a dictionary, so that idioms having a variable portion consisting of a single word as well as a word sequence can be processed for translation. Japanese Unexamined Patent Publication HEI 6(1994)-139272 discloses a machine translation system which registers idioms by using the representative symbols.
For example, idioms included in the above sentences are registered in the following manner:
Idiom registration for the sentence (1a) PA0 Idiom registration for the sentence (1b) PA0 Idiom registration for the sentence (2a) PA0 Idiom registration for the sentence (2b) PA0 Idiom registration for the sentence (2c)
______________________________________ Idiom in English both *N1 and *N2 (1) Part of speech noun phrase Translation *N1 *N2 Part of speech others (2) ______________________________________ 1: *N means a noun phrase 2: "Others" includes noninflectional nouns, adverbs and the like.
______________________________________ Idiom in English neither *N1 nor *N2 Part of speech noun phrase Translation *N1 *N2 Part of speech others ______________________________________
______________________________________ Idiom in English so *A that *c Part of speech adjective Translation *c [:] *A Part of speech (3) ______________________________________ 3: Since the part of speech of the translated idiom is determined by the part of speech of a word or word sequence *A in the translation, the part of speech thereof is not specified.
______________________________________ Idiom in English too *a to *I Part of speech adjective Translation *I [:] *a Part of speech (4) ______________________________________ 4: Since the part of speech of the translated idiom is determined by the part of speech of a word or word sequence *a in the translation, the part of speech thereof is not specified.
______________________________________ Idiom in English the same *n that *C Part of speech noun Translation *C [:] *n Part of speech (5) ______________________________________ 5: Since the part of speech of the translated idiom is determined by the part of speech of a word or word sequence *n in the translation, the part of speech thereof is not specified.
In this prior art, an idiom matchable with an input word sequence is retrieved from the idioms registered as shown above in the dictionary, and the syntax of words or word sequences represented by representative symbols is analyzed. Based on the result of the syntactical analysis, a translation corresponding to the idiom in the input word sequence is generated by using a translated expression registered as shown above.
Where a variable portion separating the constituents of the idiom is a word sequence or a phrase, i.e., where a representative symbol (a non-terminal representative symbol as shown in FIG. 15) representing a word sequence in an idiom header corresponds to the variable portion, the translation system of the prior art cuts out the variable portion to perform a syntactical analysis only for the variable portion.
Then, if the syntactical analysis is successfully completed, the translation system generates a translation of the variable portion, and inserts the translation into a translated expression of the idiom header. After a portion corresponding to the idiom is thus translated, the system translates the entire sentence by a recursive process.
A prior-art translation system having no such special registration means requires special rules as shown below to process split idioms.
In the sentence (1a), for example, a word sequence "both A and B" which is an object of the verb "have" functions as a noun phrase. Therefore, a rule is required to define the syntax of the noun phrase as follows: EQU Noun phrase.fwdarw.correlative word 1+noun phrase+correlative word 2+noun phrase
where the correlative words 1 and 2 correspond to words "both" and "and", and the first and second noun phrases correspond to "A" and "B", respectively, in the sentence (1a).
Thus, the rule is designed so as to include special parts of speech such as the correlative words 1 and 2. In case of the sentence (1a), a translation ""(no ryouhou) is assigned to the word "both" corresponding to the correlative word 1, and a translation ""(to) is assigned to the word "and" corresponding to the correlative word 2.
Then, a translation "AB"(A to B no ryouhou) is assigned to this idiom.
In case of the sentence (2a), a word sequence "so hot that children cannot drink it" functions as a complement of a word "is". Therefore, another rule is required to define the syntax of the adjective phrase as follows: EQU Adjective phrase.fwdarw.correlative word 1+adjective+correlative word 2+clause
where the correlative words 1 and 2 correspond to words "so" and "that", respectively.
Similarly, still another rule is required to define the syntax of the adjective phrase in the sentence (2b) as follows: EQU Adjective phrase.fwdarw.correlative word 1+adjective+correlative word 2+verb phrase
Yet another rule is required to define the syntax of the noun phrase in the sentence (2c) as follows: EQU Noun phrase.fwdarw.correlative word 1+noun+correlative word 2+clause (no object)
To merely cover the aforesaid five idiomatic expressions, four rules are required.
The correlative word 1 should be compatible with the correlative word 2 in each of the idiomatic expressions, and the combination thereof is predetermined. In the sentences (1a) and (1b), for example, the word sequences "both A and B" and "neither A nor B" are correct, but a word sequence "both A nor B" is incorrect. That is, the words "both" and "neither" are compatible with "and" and "nor", respectively. Therefore, additional information concerning the compatible combination of the correlative words 1 and 2 should be described in the dictionary.
In the conventional translation system, idiom headers registered in the dictionary can include a variable portion in the form of phrase and, therefore, the generality of the idiom headers can significantly be improved. However, the matching of the variable portion is performed only after a phrase corresponding to the variable portion is extracted from a source text and subjected to a syntactical analysis process, transformation process and generation process.
If the syntactical analysis of the variable portion fails, the syntactical analysis process for the variable portion is recursively performed many times and, therefore, the overhead for this process is increased. More specifically, as idiom headers having variable phrases are increasingly registered in the dictionary, more time is required for the matching of an idiom header. Therefore, the throughput of the entire translation process may be reduced by the registration of an idiom header including a variable portion of an unexpected syntax.
Further, as previously stated, the provision of numerous special exceptional rules may be disadvantageous in terms of the throughput. Even if all the rules such as the aforesaid grammatical rules possibly required for the translation process are prepared and the information concerning the compatible combination of correlative words is registered for each of the idiom headers in the dictionary, the following problems will arise.
In case of a sentence (3a) as shown in FIG. 16, in which words "is" and "designed" constitute a passive-voice verbal phrase and words "so" and "that" constitute a subordinate conjunction, the syntactical relation of the former crosses the syntactical relation of the latter. Therefore, a grammatical rule for processing such a syntax cannot be described.
More specifically, assuming that there is a sentence "abcd" consisting of four words, in which "a" directly relates to "b" and "c" directly relates to "d", the following three grammatical rules can be applied to this sentence.
X.fwdarw.ab ("a" relates to "b", and "X" is generated by arranging "a" and "b" in this order.) PA1 Y.fwdarw.cd ("c" relates to "d", and "Y" is generated by arranging "c" and "d" in this order.) PA1 Z.fwdarw.XY ("X" relates to "Y", and "Z" is generated by arranging "X" and "Y" in this order.) PA1 Z=[X, Y]=[(a, b), (c, d)]=abcd
However, if "a" relates to "c" and "b" relates to "d", i.e., the syntactical relation of the former crosses the syntactical relation of the latter, the syntax of this sentence cannot be described in accordance with the aforesaid grammatical rules.
The conventional machine translation system requires numerous grammatical rules to describe the syntactical relation between separated words in a split idiom. Even if exceptional rules are prepared to process split idioms, the syntax such as of the sentence (3a) cannot be properly analyzed for correct translation. This is because sentences of a source language may include various kinds of split idioms and increased number of exceptional processes are required for the processing of these split idioms.