1. Field of the Invention
The present invention relates generally to an improved data processing system and in particular to a computer implemented method, an apparatus, and a computer program product for learning word segmentation from non-white space language corpora.
2. Description of the Related Art
Text tokenization in Asian language scripts is very problematic since the word boundaries are not marked by spaces. In addition a non-white space language, such as Chinese, has no morphological markers and the concept of a word, in a Western language sense, is arguable. The term “non-white space” language simply means the typical use of white spacing to separate words as used in Western languages. The Chinese written languages build from a set of thousands of characters, in which each symbol may represent a morpheme or a syllable. A set refers to a collection of one or more items. For example, a set of characters is one or more characters. In classic Chinese, each character corresponds to one morpheme, which is a meaningful unit of language. Modern Chinese has a tendency to form new words through combining several symbols. Therefore, a Chinese word can consist of more than one character or morpheme, usually two, but there can also be three or more characters. Each morpheme has a certain meaning, but when combined with other morphemes, the original meaning may be altered, which may even change the sentence structure. Thus, any word segmentation should deal with resolving the uncertainty of characters caused by the various combinations of characters.
For example, a previous solution attempted to resolve overlapping ambiguities in Chinese word segmentation using adapted classifiers that could be trained using an unlabelled Chinese text corpus. In this example, an attempt was made to identify Chinese words. In another example, there is provided a facility for selecting, from a sequence of natural language characters, combinations of characters that may be words, using indications, for each character of a sequence of characters. For each of a plurality of contiguous combinations of characters occurring in the sequence, the facility determines whether the character occurring in the second position of the combination is indicated to occur in words that begin with the character occurring in the first position of the combination. Thus, a determination is made to construe words from text analysis.
Tokenization, which may include text segmentation, is a process of demarcating and possibly classifying sections of a string of input characters, whether they are words or other text segments. There are several known tokenization techniques. For example, in one method tokenization can be entirely based on lexical resources and linguistic information. This method is only as accurate as the coverage of the lexicon and the tokenization rules, but may lead to partial processing when there is missing information. In another example, N-gram tokenization is commonly used for text segmentation, such as Chinese, Japanese, and Thai. This method may not always be accurate enough because it does not take into consideration the lexical information, such as that in a lexicon and the linguistic rules.
Therefore, it would be advantageous to have a method, apparatus, and computer program product for breaking text in a manner that overcomes some or all of the problems discussed above.