With the rapid development of the Internet, information on the Internet increases exponentially. For such a huge number of information resources, how to obtain required information effectively and quickly becomes more and more important for people. To improve quality and efficiency of information retrieval of users, many powerful information retrieval tools such as search engines appear one after another. Search engines bring great convenience to people; however, it turns out that search technologies in which a keyword is used as a basic indexing unit have many defects. In one aspect, too many results are returned for any keyword submitted by a user, among which only a small part of information is what the user actually needs, causing that the user has to spend a considerable amount of time in manually filtering these results; in another aspect, many documents related to a search topic may not include a keyword input by the user, resulting in that these documents cannot be found by a search engine.
Performing automatic classification on information is an effective approach to solve the foregoing problem, which can significantly solve problems of heterogeneous and disorderly information on the Internet, so as to narrow a search range, increase a retrieval speed, and improve a query result. Because currently, most information on the Internet appears in a form of a document, such as electronic mail (e-mail), electronic magazine, technical report, news, and online library, methods and technologies of document classification become the most important research focuses in the field of information retrieval.
A main task of document classification is to automatically determine an associated category according to document content. Currently, statistics-based and machine learning-based document classification methods are the most used by people. A common model of Bayesian classification, one of the statistics-based document classification methods, is a Latent Dirichlet Allocation model.
The Latent Dirichlet Allocation model is a probability generation model and can be used to identify latent topic information in a large-sized document set or corpus. The Latent Dirichlet Allocation model uses a bag-of-words method, in which each document is regarded as a word frequency vector, so that document information is changed into digital information that can be easily modeled. Each document represents a probability distribution formed by some topics, and each topic represents a probability distribution formed by many words.
For each document in a corpus, the Latent Dirichlet Allocation defines the following generation process:
(1) for each document, extracting a topic from a topic distribution;
(2) extracting a word in a word distribution corresponding to the extracted topic; and
(3) repeating the process until every word in the document is traversed.
Referring to FIG. 1, in a more formal manner, each document in a corpus corresponds to one multinomial distribution of K topics (K is given in advance by using repeated experiments or another method), and the multinomial distribution is marked as θ. Each topic corresponds to one multinomial distribution of V words in a vocabulary, and the multinomial distribution is marked as Φ. θ and Φ each have one Dirichlet prior distribution with hyperparameters α and β. For each word in one document d, a topic Z is extracted from the multinomial distribution θ corresponding to the document, and then a word ω is extracted from the multinomial distribution Φ corresponding to the topic Z. A document D is generated by repeating the process Nd times, where Nd is the total number of words of the document D. In FIG. 1, a shaded circle represents an observable variable, an unshaded circle represents a latent variable, an arrow represents a condition dependency between two variables, and a box represents repeated sampling, where the number of times of repetition is shown at the lower right corner of the box.
The Latent Dirichlet Allocation model has two parameters that need to be inferred. One is a “document-topic” distribution θ, and the other is a distribution Φ of K “topic-words”. By learning these two parameters, a topic in which a document author is interested and proportions of topics covered in each document can be known.
Through actual calculation, it can be found that, for large-sized document data, although the Latent Dirichlet Allocation model can simulate a correlation between a topic and a word in a document, it is required to manually specify the optimal number of topics K in advance according to experience, and the input number of topics K has a significant impact on precision of a model to be solved. Therefore, how to choose the optimal number of topics K is still a difficult problem.