Among languages and texts over the world, one type of languages, such as English and German, has word boundary markers and generally uses a space as a boundary marker to divide between words. Another type of languages includes languages without word boundary marker, such as Chinese, Japanese and Korean, in which no delimiter exists between words within a sentence. As computer technology develops, applications such as search engine, text search and machine translation all involve the issue of text processing. Therefore, how to segment a sentence into words and phrases has become a primary concern.
For ease of description, the Chinese language is used as an example for illustration, although the following description applies to similar languages other than the Chinese language. Chinese word segmentation technology has been around for several decades. As early as the 1980's, people have started to investigate how to use computer to automatically segment Chinese terms. Word segmentation refers to the process of identification of each meaningful term from a sentence composed thereof.
One example of Chinese word segmentation includes word matching. Terms in a given string of Chinese words are matched to corresponding terms in a dictionary (or lexicon). When a match cannot be found in the dictionary for a given term, the term is further segmented into individual words, or Chinese characters. Accordingly, simple word segmentation can be completed. Thereafter, word sequences can be formed.
For example, a sentence such as “” can be processed as described above by looking up a dictionary. In particular, this sentence can be segmented into the terms of “” (English translation of the terms: China-aerospace-officials-are invited-to-U.S.-to meet with-space-agency-officials). However, this approach may fail when ambiguity exists. For example, “” may be mistakenly segmented into “” (English translation of the terms: develop-China-home) while the correct answer is “” (English translation of the terms: developing-country). As another example, “” may be mistakenly segmented into “” (English translation of the terms: Shanghai University-town-bookstore) while the correct answer is “” (English translation of the terms: Shanghai-College Town-bookstore).
In order to solve the problem of ambiguity, all possible word sequences need to be considered. In one of the examples above, the phrase “” can be segmented into the two word sequences of “” and “”. Consequently, in this case, certain optimization rules of selecting a word sequence is needed to select the latter as the optimal word sequence.
A maximum matching method known as MMSEG is a simple algorithm of selecting an optimal word sequence based on a number of rules such as largest term matching and maximum average word length. Another comparatively more sophisticated method is a statistical language model proposed by Dr. Jin Guo of Tsinghua University during the 1990's.
The statistical model computes probabilities of occurrence of a sentence upon word segmentation and reckons a word sequence having the highest probability to be the optimal word sequence. Simply put, a probability of occurrence of a sentence refers to a product of probabilities of each term, given the occurrence of respective terms prior thereto. For the first word sequence in the above example, its probability is the probability of having started with “” multiplied by the probability of having “” behind “” and further multiplied by the probability of having “” behind “” and “”. This method of selecting an optimal word sequence has proven to be accurate and effective.
However, this simple and effective approach of word segmentation has a relatively serious problem. That is, when a sentence is very long, a larger variety of word sequences exist. If all possible word sequences are exhaustively listed and probabilities of the sentence in each possibility are computed, computational load may be tremendous. Not only does the statistical model face this problem, other methods of selecting an optimal word sequence also encounter the problem of excessively large computational load.