The present invention relates to segmenting text. In particular, the present invention relates to segmenting text that is not delimited by spaces.
In many languages, such as Chinese and Japanese, it is difficult to segment characters into words because the words are not delimited by spaces. Methods used in the past to perform such segmentation can roughly be classified into dictionary-based methods or statistical-based methods.
In dictionary-based methods, substrings of characters in the input text are compared to entries in a dictionary. If a substring is found in the dictionary, it is kept as a possible segmentation of the text. Note that in order to achieve the proper segmentation, every word in the text string must be found in the dictionary. Unfortunately, because of the introduction of new words into the language, especially named entities, it is impossible to always have a complete dictionary.
In statistical-based systems, a scoring mechanism is used to segment the text. Under such systems, models are trained based on a corpus of segmented text. These models describe the likelihood of various segments appearing in a text string. The models are applied to possible segmentations of an input text string to produce scores for each segmentation. The segmentation that provides the highest score is generally selected as the segmentation of the text.
Although such systems overcome the requirement of having every word in the input text in a dictionary, they also have limitations. In particular, such systems are not able to identify an entity type, such as name, location, or date, for a word if the word is not in the dictionary. As a result, after the segmentation, additional processing must be done to categorize unknown words into entity types so that further lexical processing can be performed.
One system attempted to solve this problem by using finite state transducers. A separate finite state transducer was provided for each entity type. During segmentation, the text would be applied to the finite state transducers. If a sub-string of characters satisfied a path of states through a finite state transducer, the sub-string would be identified as a possible segment and it would be tagged with the type associated with the finite state transducer.
Although such systems are able to segment and identify word types at the same time, they are limited in a number of ways. In particular, finite state transducers are not able to accommodate many of the morphological changes that occur in Chinese text. As a result, the finite state transducers cannot handle every type of word found in Chinese text. This means that further processing to categorize unknown words is still required with the finite state transducers. Thus, a new method of segmenting non-segmented text is needed that can segment the text and identify word types in a single unified system while accommodating all of the different word types that can be found in unsegmented languages.