This specification relates to document processing.
An electronic document can have one or more topics. A computer can automatically extract the one or more topics from the electronic document using a type of statistical model known as a topic model. An example topic model is latent Dirichlet allocation (LDA). According to LDA, a topic is a probability distribution of words. For example, a topic that has a specified probability distribution associated with words tabby, purr, and kitten can be a topic on “cat.” The computer can analyze the electronic document, including, for example, calculating the probability distribution of each of the words tabby, purr, and kitten as included in the document. The calculated probability distribution can indicate a likelihood that the electronic document is associated with the topic “cat.” The topic is abstract. The word “cat” is an arbitrary label of the abstract topic.
In an LDA, each document is modeled as a mixture of K topics, where each topic, k, is a multinomial distribution φk over a W-word vocabulary. For any document dj, its topic mixture θj is a probability distribution drawn from a Dirichlet prior with parameter α. For each ith word xij in dj, a topic zij=k is drawn from θj, and the word xij is drawn from φk. The generative operations for LDA are thus given byθj˜Dir(α),φk˜Dir(β),zij=k˜θj,xij˜φk,  (1)where Dir(*) denotes a Dirichlet distribution; α and β each is a Dirichlet prior.