1. Field of the Invention
The present invention relates to the field of predicting readings of foreign languages, and more particularly, to the reliable and effective reading predictions of Japanese ideographs.
2. Brief Description of Prior Developments
The Japanese language is written using a combination of four scripts: hiragana, katakana, romaji, and kanji. Hiragana and katakana are syllabaries—phonetic scripts in which each character represents a syllable of a word. Generally, hiragana and katakana are collectively referred to as kana. Katakana are usually reserved for writing words that have been borrowed from foreign languages (except Chinese) within the last 400 years; they also may be used to provide emphasis or for graphic effect. Romaji are an alphabet—the familiar Roman alphabet used in North America, Western Europe and elsewhere. In the past, romaji have been used to transcribe loan words, for emphasis, and to transcribe Japanese for foreign armies of occupation. Kanji are ideographs—characters that represent specific words or parts of words, rather than specific sounds. It is not the case that kanji are only related to free floating ideas, however. The link between kanji and words is fixed, for the most part. That is, for most words, a writer cannot choose between different kanji. For example, even though all Japanese speakers would agree that both the characters  and  essentially mean “dog”, it would be incomprehensible to write the word  (chuuken) “faithful dog” using the character . Likewise, the link between words and their pronunciation is fixed. That is, dialectal variation aside, there is usually only one way to pronounce a word. Thus, there is a firm link between kanji and pronunciation, but it is not a direct one—it is always mediated through the particular word that is being written.
Writers can however choose whether or not to use kanji at all. It would not be incorrect to write chuuken using hiragana (), hiragana (), romaji (chuuken), or a mixture (, ). It is very common to write words (especially verbs) in a combination of kanji and hiragana. However, any other mixture of scripts within the same word is unusual enough to be considered an error. Because a word that contains kanji can also be written in a phonetic script, it is possible to talk about the phonetic value of the kanji in that word. This is what is meant by the reading of a kanji in a particular word—its pronunciation when the word is read aloud, or its spelling in a phonetic script when the word is written phonetically. For example, the reading of  in  is ken. However, because of the particular history of Japanese, most kanji have at least two entirely distinct readings. For example, the reading of  in the word  (inuoyogi) is inu;  is read as nin in  (ningen), jin in  (nihonjin), and hito in  (hitobito). Furthermore, many kanji have different readings that are systematically related to each other. For example,  is read as hatsu in  (kaihatsu), ha? in  (happyou), and patsu in  (kappatsu).
A final source of complexity when determining the underlying reading of Japanese written language (e.g. Japanese script) is that there is some variation in how much of a word is represented in kanji. For example, the word kakitsuke is sometimes written as , but at other times as . The reading of the kanji  is ka in the first variant, kaki in the second. Both of these variants are considered acceptable, but to mix the two variants in a single document is considered an error.
Given all of the above-mentioned sources of variation, predicting the correct reading of a kanji in a given word is not a simple task. Educated native speakers of Japanese can usually remember or guess the correct readings of kanji, but software is less successful at performing this task.
Currently practices in automating the reading of Japanese script are inefficient and can be unreliable. For example a brute force solution to the problem is to create a dictionary of words and link the entry for the phonetic spelling of a word to the entries for all its other dictionary spellings. This type of solution, however, faces several problems. Since Japanese is traditionally written without inserting space between words, it is far from trivial to look words up in a dictionary. It would be necessary to first identify the boundaries between the words, requiring a considerable level of linguistic knowledge and an expenditure of significant resources. Because Japanese is a more highly inflected language than English, it is quite common for word forms to be extensively modified by affixation and compounding; a dictionary that contained every possible form of a word would be astonishingly large and unwieldy. As such, no dictionary could be sufficiently large to adequately predict readings of Japanese script. Further, since new words are always being coined or borrowed such a dictionary would have to be adaptable and updateable.
From the foregoing it is appreciated that there exists a need for systems and methods that efficiently and reliably predict the reading of Japanese script. By having these systems and methods, the drawbacks of existing practices are overcome.