With the continuous development of science and technology, the volume of textual data generated by users has increased exponentially, and many neologisms such as “SARS” emerge in association with new articles, events, or experiences. The occurrence of the neologisms usually overwhelms some text information processing models, such as the models for a word segmentation process. A word segmentation process in Chinese natural language processing usually functions as a fundamental step. The quality or precision of a word segmentation result of a piece of text information affects subsequent text information processing tasks such as text classification, text clustering, and/or topic identification. In many applications, a neologism discovery process may be implemented to address the above-identified issues.
Usually, neologism discovery methods may be classified into statistics-based methods and rule-based methods. A statistics-based method is usually implemented based on generating statistic information using a hidden Markov model, maximum entropy, a support vector machine, and/or the like. The statistics information can be used for generating a word segmentation model. Moreover, a rule-based method is implemented based on deriving a set of rules from a template feature library and training textual data that includes labeled contextual features. The set of derived rules may correspond to word formation rules and can be applied to a piece of to-be-processed text information that may include a neologism.
The inventor of this application notes that all of the foregoing solutions include performing a word segmentation process on the textual data. The inventor notes that a neologism discovery method that is based on word segmentation usually includes performing, repetitively in interactions, combination of training textual data and a piece of to-be-processed text information, generation of updated training textual data for generating an updated word segmentation model, and discovery of a neologism. However, such iterative processes are complex and demanding on computational resources. In addition, at a neologism discovery stage, because a to-be-discovered neologism does not have a definite definition, a neologism discovery method may not be able to significantly improve its performance based on determining a boundary of the to-be-discovered neologism and/or relying on a known dictionary or rule.