Japanese text employs four different writing systems that each employ separate character sets. These writing systems are Hiragana, Katakana, Kanji and Romaji. Katakana characters represent syllables, typically consonant and vowel combinations and are used for writing words borrowed from Western languages, such as English. Hiragana characters also represent syllables and are used most extensively for writing grammatical words, such as adverbs, functional categories, such as verbal inflection, and other markers. Hiragana and Katakana are collectively known as Kana. Depending on the corpora, words written in Hiragana and Katakana have an average word length between three and five characters. Kanji characters are characters that were mostly borrowed from Chinese and are ideographic characters that represent meaning. Romaji are Roman characters, such as found in the Roman alphabet that is used for English.
In natural language processing, the presence of the multiple writing systems complicates the task of processing and parsing Japanese text. This task is further complicated by the manner in which words are written in Japanese. In particular, words are written together without separating spaces (i.e., there are no delimiting white spaces between words). It is, thus, difficult for a computer system to identify individual words within a text string written in Japanese. One conventional approach has been to maximally match Kana and Kanji in the text string with words in a dictionary. Unfortunately, in order to identify a large number of words, this approach requires a large dictionary that is too large to efficiently store in primary memory (i.e., RAM). As a result, the dictionary must be stored in secondary memory and the overhead associated with accessing secondary memory must be incurred each time that a word is sought from the dictionary. Moreover, even very large dictionaries cannot guarantee complete coverage of all words. This difficulty is complicated by the dynamic nature of what words are part of a given natural language. Words are added (i.e., new words are coined) and words are removed from a language (i.e., words fall out of use or become antiquated) as time progresses. Thus, a fixed dictionary, by its nature, limits the coverage for words of a given language, and the dictionary will lose coverage over time.