Automated language analysis systems embedded in a computer typically include a lexicon module and a processing module. The lexicon module is a table of lexical information, such as a "dictionary" or database containing words native to the language of the input text. The processing module includes a plurality of analysis modules which operate upon the input text in order to process the text and generate a computer understandable semantic representation of the natural language text. Automated natural language analysis systems designed in this manner provide for an efficient language analyzer capable of achieving great benefits in performing tasks such as information retrieval.
Typically the processing of natural language text begins with the processing module fetching a continuous stream of electronic text from an input module. The processing module then decomposes the stream of natural language text into individual words, sentences, and messages. For instance, individual words in the English language can be identified by joining together a string of adjacent character codes between two consecutive occurrences of a white space code (i.e. a space, tab, or carriage return).
Japanese language text, and other Asian languages such as Chinese and Korean, can not be separated into individual words as easily as English language text. Asian language text typically includes a string of individual characters each separated by white-space. Words in these Asian languages are formed of a single character or a successive groups of characters, but the boundaries between the words are not explicitly identified in the written text. The written text does not clearly indicate whether any particular character forms a complete word or whether the particular character is only part of a word. In addition, the written characters may be from one or more character alphabets. For example, Japanese words may be formed in one of three character types: Katakana, Hiragana, Kanji, and Romaji characters. Identifying these ambiguous word boundaries between the characters proves important in electronically translating or processing Asian language documents.
Some prior art systems attempt to determine these word boundaries with simple pattern matching rules while other prior art systems resort to using a database of Asian language words to identify word breaks in Asian language text. For instance, U.S. Pat. No. 5,029,084, issued to Morohasi et al., discloses a system that combines various pattern matching approaches to determine word boundaries in the text. The Morohasi system identifies character divisions based on character type definitions (i.e. Katakana, Kanji, Hiragana) and then processes the sentence by comparing the characters to a content word dictionary containing Japanese words. For any character segments remaining after this initial processing, a series of compound word synthesizing rules are used to determine the division of the remaining segments. This system has the drawback of performing an up front costly comparison analysis of the characters in the stream of text with a content word dictionary of the Japanese language.
Other prior art systems use morpheme analysis to determine the word breaks in a Japanese language sentence. U.S. Pat. No. 5,268,840, issued to Chang et al., describes a method and apparatus for morphologizing text. The Chang system discloses segmenting the input text of characters into the longest morphemes that can be formed from the input text. This is achieved by forming the longest morpheme from the remaining characters in the sentence which is listed in a dictionary of valid morphemes and determining if it is conjunctive with the previously divided morpheme. The conjunctiveness of successive morphemes can be based upon grammar rules that require two adjacent morphemes to obey certain rules of connection.
Morphological analyzers of the type disclosed in Chang have efficiency problems. For example, subsequent identification of morphemes beyond the initially identified morpheme may indicate that the earlier identified morphemes are incorrect and require further analysis. This inherent recursive nature of the system causes inefficiencies in the processing of the input text. In addition, the morphological analysis of Chang requires two separate processing steps. In the first step, the system identifies the morphemes themselves and in the second step the system requires the application of the morphological rules to the entire document. Thus, a morphological analysis system typically requires considerable computer processing effort and frequent database accessing resulting in longer processing times, coupled with the ever present risk of needing to review and reassess earlier faulty analysis.
Accordingly, an object of the invention is to provide a word breaker that efficiently and accurately identifies word breaks in a stream of Asian language text.
Other general and specific objects of the invention will be apparent and evident from the accompanying drawings and the following description.