Data mining applications routinely rely on an analysis of document specific features to perform document classification and/or clustering functions. Examples of such document specific features include, but are not limited to, the presence or absence of particular words in the document text and/or the number of instances that particular words or particular sequences of words appear in the document text. Document specific features are typically used to index documents. Search engines routinely identify specific documents that best match specific search queries based on the document specific features associated with each of the individual documents in an indexed document set.
In most languages, individual words within the written text are easily discernable because the words within the text are delimited by spaces or punctuation. Splitting document text into individual words in such languages is typically a fairly straightforward process. In the case of some languages however, such as for example the Chinese language, the written text does not include any indication of breaks between consecutive words. Chinese text typically consists of one or more consecutive sequences of characters that may include more than one word within a single character sequence. The reader typically infers the breaks between the Chinese words in the Chinese character sequence based on the context of the words within the Chinese character sequence.
One prior art data mining solution simply ignores the Chinese text present in a document and relies on the non-Chinese text (typically English text) that may be present within the body of the document to extract document specific features. However, documents that include only Chinese text cannot be processed using this particular prior art solution as this prior art solution lacks the ability to identify and process Chinese words.
Another prior art data mining solution treats each individual Chinese character within the Chinese text contained within a document as a separate feature. However, in some cases, a character may be a component of a number of different Chinese words. In many cases, the different Chinese words that share a common character have very little in common with each other. As a result, treating each individual Chinese character as a feature may lead to errors in document indexing.
Another prior art solution leverages the fact that most Chinese words are two characters long. Each consecutive two character string within a Chinese character sequence is treated as a word or feature. For example, if a Chinese character sequence includes a sequence of five characters, the first and second characters are treated as a first word, the second and third characters are treated as a second word, the third and fourth characters are treated as a third word, and the fourth and fifth characters are treated as a fourth word.
Another prior art data mining solution also leverages the fact that most Chinese words are two-characters long. Each Chinese character sequence is segmented into consecutive two-character words with the first character of each word starting with an odd number character within the Chinese character sequence. For example, if a character sequence includes six characters, the first and second characters are treated as a first word, the third and fourth characters are treated as a second word, and the fifth and sixth characters are treated as third word.
Since not all Chinese words are two characters long, the presence of a word with more that two characters, such as for example a three character word, can introduce errors into the mined data. Furthermore, the data mining application may capture a large number of non-words formed by combining the last character of a first two-character Chinese word with the first character of the next two-character Chinese word. The number of non-words captured by the data mining application may dwarf the number of actual Chinese words retrieved thereby affecting the accuracy of the indexing of the documents.
Another prior art data mining solution requires that the Chinese character sequences in a document be separated by some form of a word separation character in order to perform Chinese word related data mining functions. For example, the Unicode character set has a zero-width non-joiner character that is intended to be used as a word separation character to logically separate words that are not displayed with visible separation. While the use of word separation characters facilitates the splitting of Chinese text into Chinese words, most documents that include Chinese text do not use word separation characters. Most Chinese document generators, such as for example, Chinese typists, are typically not trained to use such word separation characters. This prior art data mining solution lacks the capacity to process Chinese text that does not include word separation characters.
Another prior art data mining solution recognizes Chinese words contained within a Chinese dictionary or list of known Chinese words to perform data mining operations. Chinese dictionaries are often large and incomplete making it impractical to fold the dictionary into a tool that can be transmitted over a network or stored on a small appliance. The use of Chinese dictionaries often requires a large amount of RAM to use the dictionary or a large number of accesses to a storage device storing the dictionary. Multiple accesses to a storage device may slow down the operations of a data mining application. New terms are also continually being added to the Chinese vocabulary, especially in technical areas. The use of an incomplete Chinese dictionary may result in the mining of irrelevant words and missing the mining of potentially relevant words in a document.
Thus what is needed is a system and method of splitting a Chinese character sequence into word segments that seeks to overcome one or more of the challenges and/or obstacles described above.