The identification of word boundaries in continuous text is used in several areas such as word processing, text processing, machine translation, fact extraction, and information retrieval. Prior art methods for identifying word boundaries have used various approaches including whole words; word-initial and word-final n-grams and their frequencies; or a hidden Markov model of n-grams, word boundaries and their frequencies.
The article J. Guo, "An Efficient and Complete Algorithm for Unambiguous Word Boundary Identification", formerly found at http://sunzi.iss.nus.sg:1996/guojin/papers/acbci/acbci.html and as referenced in J. Guo, A Comparative Experimental Study on English and Chinese Word Boundary Ambiguity," Proceedings of the International Conference on Chinese Computing 96 (ICC 96) June 4-7, 1996 Singapore (National University of Singapore, Singapore), pp. 50-55, discloses a method which uses whole words implemented by an Aho-Corasick finite-state automaton. Another prior art method which uses a dictionary of whole words is U.S. Pat. No. 5,448,474, "Method for isolation of Chinese words from connected text". The foregoing references are herein incorporated by reference. A disadvantage to methods using whole words or entire vocabularies is the amount of storage space required. In addition, only words included in the dictionary may be identified. Finally, it is not possible to rank or order competing possible word boundary candidates or to establish the best word boundary among competing possible word boundary candidates.
Several methods have attempted to overcome the problems presented by using a dictionary of whole words. In U.S. Pat. No. 5,806,021, "Automatic Segmentation of Continuous Text Using Statistical Approaches," Chen et. al., a method is disclosed which uses two statistical methods. First, forward and backward matching is performed using a vocabulary with unigram frequencies. Then, a score is calculated using statistical language models. Another prior art method uses a combination of rules, statistics and a dictionary. (See U.S. Pat. No. 5,029,084, "Japanese Language Sentence Dividing Method and Apparatus", Morohasi et. al.) The foregoing references are herein incorporated by reference.