The following relates to the automated information management arts, automated document retrieval arts, automated document annotation or labeling arts, and related arts.
Document processing operations such as automated classification or topic labeling, document retrieval based on a query or representative document, or so forth, typically employ a so-called “bag of words” or BOW representation. The BOW representation is typically computed for documents of a set of documents, and is a vector in which each dimension corresponds to a particular term (i.e., word) occurring in the set of documents, and the value stored for each dimension corresponds to the frequency of occurrence of that word in the document. In some cases the frequency is normalized by the total number of words in the document (to reduce the effect of document length) and/or is scaled by a metric indicative of the frequency of occurrence of the word in the set of documents. An example of the latter is the TF-IDF representation, where IDF stands for “inverse document frequency” and is computed by a formula such as |D|/(Nw+1) where |D| is the number of documents in the set of documents, Nd is the number of those documents containing the word w, and the “+1” avoids division by zero in the case of a word occurring in none of the documents (optionally omitted if the BOW vector elements are limited to words that occur in at least one document).
To translate the BOW representations into topical information, the BOW representations of the set of documents are typically modeled by a topical model. Two commonly used probabilistic topical models are: probabilistic latent semantic analysis (PLSA) and Latent Dirichlet Analysis (LDA). In effect, these techniques define topics (sometimes called categories or classes) in which each topic is defined by a representative BOW vector. An input (e.g., query) document is then assigned in a probabilistic sense to various categories based on how closely the BOW vector of the input document matches the representative BOW vectors of the various topics.
BOW representations modeled by a probabilistic topic model such as PLSA or LDA are widely employed in applications such as document annotation/archiving and document retrieval. For annotation purposes, an input document can be labeled with the topic or topics that most closely match the BOW vector of the input document. Document retrieval entails retrieving documents having the same or similar topic labels, or alternatively can operate directly on the BOW vectors (i.e., by retrieving documents whose BOW vectors are most similar to the BOW vector of the input document).