1. Technical Field
The invention disclosed broadly relates to data processing systems and methods and more particularly relates to a linguistics method for isolation of Chinese words from connected Chinese text.
2. Background Art
The Chinese language is written as "logographs" each of which represents one syllable and usually a concept or meaningful unit. Chinese is traditionally written without spaces between these logographs. A Chinese "word" may consist of one or more of these logographs, and a reader of Chinese must identify the boundaries of these words to make sense of the text.
Chinese documents in electronic form are also written without spaces and this makes it difficult for computer applications such as Information Storage and Retrieval (IS/R) to identify terms for use in a mechanized index. Of course, the problem for IS/R can be solved by the brute-force approach of indexing every character of the text to make it possible to look for every combination of characters, but this is very inefficient because it uses too much index space and retrieves a lot of irrelevant results (low precision).
While the IS/R application can be solved without having to identify the words of Chinese text, there are other applications such as computer-assisted translation that require accurate identification of the words in order to provide a meaningful translation. It is the object of this invention to define a process for identifying all the words in a Chinese text string, to resolve overlapping words into a set of adjacent words through successively stricter filtering mechanisms that eliminate illogical segmentations, to resolve ambiguities by the use of frequency criteria and grammatical constraints.