Since the advent of Internet technology, information growth has been explosive. Information retrieval, information analysis, and machine translation are important technologies for making effective use of the information. Automated Chinese language word segmentation is a fundamental technique for the processing of Chinese language information. One difficulty that influences the effectiveness of automated word segmentation is the recognition of previously unlisted words. Unlisted words refer to words yet to be recorded in the word segmentation dictionary. Unlisted words can be divided into two types. One type is words that cannot be listed in dictionaries in their entirety, but for which it is possible to summarize patterns (such as personal names, institutional names, etc.); the other type is new words that should be listed in the dictionary, but have yet to be listed. Among these new words, some are target words that should be listed in the segmentation dictionary, while others are not words, that is to say, they are non-target words that should not be listed in the dictionary.
When recognizing newly appeared words, first a determination must be made as to whether these newly appeared words are words or not, specifically, whether the newly appeared words are target words or not. Currently there are three typical approaches for making the determination: a rule-based method, a statistics-based method, and a method that combines rules and statistics. The most popular statistics-based method is generally to collect statistics with regard to one or several characteristic values of words to be recognized based on large-scale text data, and, based on the statistical results, manually set a threshold value. When a word to be recognized exceeds the established threshold value, the word is determined to be a target word.
However, with the widespread use of the Internet and in an environment when the volume of text data that appears on the Internet is very large, there is already a lack of complete semantic sentence patterns just for the accumulation of certain keywords. For example, on e-commerce websites, and particularly on consumer-to-consumer or customer-to-customer (C2C) e-commerce websites, there can be a massive number of product headers. A large number of these keywords are newly appeared words; however, at this time, the statistical distributions for these newly appeared words tend to be non-linear. When recognition is performed, the results obtained by setting a single threshold value with regard to characteristic values and then determining whether or not the newly appeared words are target words according to the single threshold value are often inaccurate. Thus, the conventional statistics-based method of deciding whether or not words to be recognized are target words is often not well suited for target word recognition in current network applications.