This invention relates to a device for breaking a Japanese sentence into a succession of morphemes. The device, previously called a "Device for Analyzing Japanese Sentences into Morphemes with Attention Directed to Morpheme Groups", is herein called by the short form name, "Morpheme Analysis Device."
A sentence consists of morphemes. Each morpheme may be either a dictionary word or an allomorph, depending on the circumstances. A sentence or portion of a sentence may properly be called a syntagm, since a syntagm is defined as a phrase, a clause, or a sentence.
The concept of morphemes is very useful for a language such as Japanese. In the English language, morphemes are easily detected since morphemes correspond to words, and spaces are placed around words. This is not true, in contrast, for a language such as Japanese, in which sentences are written without spacing, and thus, there is no pause between successive morphemes.
The morpheme analysis device is useful in a machine translation system which deals with the Japanese language as a source language. The morpheme analysis device is useful also in a speech sound synthesis system for producing speech sound in compliance with a text written in the Japanese language.
For the purposes of illustration, this specification will refer to Japanese syntagms written in an English equivalent as often as possible. Japanese use Chinese characters and phonetic characters (called Kanji and Kana, respectively) for their writing system. Chinese characters will not be used herein. If, however, it becomes necessary to phonetically represent a Japanese syntagm, the Japanese syntagm will be written in accordance with International Standard ISO 3602.
One method of separating a morpheme in a syntagmfrom other morphemes is described in "Pocket Handbook of Colloquial Japanese" cited in U.S. Pat. No. 4,635,199, issued to Kazunori Muraki and assigned to the present assignee and incorporated herein by reference. This method provides that the syntagm is first separated into morphemes with reference to a dictionary and by using a known algorithm for breaking the syntagm into morphemes. Examples of known algorithms are used in U.S. Pat. No. 4,931,936, issued to Shuzo Kugimiya, et al., and U.S. Pat. No. 4,771,385, issued to Kazunari Egami, et al., both incorporated herein by reference. Nevertheless, certain morphemes are ambiguous and may have more than one meaning which cannot be resolved by simply separating the morphemes.
In the Japanese language, use is occasionally made of a group of at least two morphemes, in which the morphemes customarily appear in a specific order. Examples of such a morpheme group are a section, a department, and a division of a corporation. It is possible to use such morpheme groups in resolving ambiguities which would otherwise be inevitable when breaking a syntagm into morphemes.
The following are examples of morpheme groups in which the morphemes appear in a customary order. In describing the following examples, the symbol "PT" is used in lieu of one Chinese character. As these examples illustrate, in Japanese, the character PT means, a part or parts, and is used in expressing, inter alia, (1) a time instant, (2) a measure of an angle, and (3) a ratio.
One Japanese syntagm, containing the morpheme PT, is "Zyuzi nizippun," which means "twenty minutes past ten" in English. When translated morpheme by morpheme into English and arranged from beginning to end of the Japanese syntagm, the syntagm is composed of four morphemes: "ten", "o'clock", "twenty", and "minutes" in English. The syntagm is usually written in Japan by a set of two Arabic numerals (or, more correctly, Hindu numerals) 10, another Chinese character represented herein by O'C, another set of Arabic numerals 20, and the character PT. When used to express a time instant in this manner, the character PT customarily appears after the other character O'C in a morpheme group which consists of two morphemes written by the characters 0'C and PT. These characters may be called a time instant group.
A second Japanese syntagm also containing the morpheme PT is "zyudo nizippun," which means "ten degrees [and] twenty minutes" in English. The syntagm is composed of four morphemes: "ten", "degrees", "twenty", and "minutes" in English. The syntagm is ordinarily written by a set of Arabic numerals 10, still another Chinese character represented herein by DG, a set of Arabic numerals 20, and the character PT. When used to express a measure of an angle, the character PT is found after the other character DG in a group which consists of two morphemes written by the characters DG and PT. These morphemes may be called an angle group.
A third example of a syntagmcontaining the morpheme PT is a ratio. A ratio may be expressed in the Japanese language either according to a traditional expression, or by using "percent" which is pronounced and written "pasento" in kana letters. In the traditional expression, the character PT means a hundredth or hundredths and is located after yet another Chinese character which means a tenth or tenths. The tenths character will herein be represented by TTH. When used to express a ratio according to the traditional expression, the character PT occurs after the other character, TTH, in a group which consists of two morphemes written by the characters TTH and PT. These morphemes may be called a ratio group.
Another example of morpheme groups with customary orders are post office addresses. It is possible to express any address in Japan by using some of about fifteen morphemes of a group, however, addresses within postal wards are simplified and thus shorter. For example, the full address of the NEC Corporation is "Tokyo-to Minato-ku Siba Gotyome Sitiban [or, Nanaban] Zyugogo" in the Japanese language. Depending on localities in twenty-three wards in Tokyo, a proper noun is substituted together with a suffix "mati" or "tyo" for the word "Siba" used in the above address. As the above address illustrates, an address used within a Tokyo ward uses five or six morphemes out of a group of the seven morphemes "to", "ku", "mati", "tyo", "tyome", "ban", and "go" in that order. This morpheme group may be called a Japan address group or simply an address group.
Such a group of morphemes is useful, if applicable, in breaking a syntagm into analyzed morphemes and resolving ambiguities. As illustrated by these three examples, the character PT has several meanings, depending on its context as determined by other characters or morpheme group within which it appears. The morpheme PT expresses a time instant if found after the morpheme O'C in the time instant group, a measure of an angle if located after the morpheme DG in the angle group, or a ratio if it appears after the morpheme TTH in the ratio group. Furthermore, morphemes within a Japan address group can be resolved if it is noted that they occur within a Japan address group.
Heretofore, no device has been known which has a capability of both analyzing morphemes and resolving these ambiguities. When breaking a syntagm into morphemes, ambiguities have not been removed by taking advantage of the customary order of morphemes within a morpheme group.