The following relates to the document or object processing arts, clustering arts, classification arts, retrieval arts, and so forth.
In document (or more generally, object, a general term intended to encompass text documents, images, audio/video content, or so forth) processing, it is useful to generate a statistical topic model defining a set of topics. For text documents represented using “bag-of-words” representations, a topic of the topic model is suitably represented as a statistical distribution over words that are typical of the topic. A new document can be associated with topics with varying association strengths based on similarities between the topics and the distribution of words in the document. As another application, given an input document selected from a corpus of documents already modeled using the topic models, similar documents can be rapidly identified by comparison of the topic association strengths.
The word distributions of a text document can be considered features of the text document, and the topics of the topic model are statistical distributions of these features that are typical of the topics. For other types of objects, features of the objects are derived and topics of the topic model are generated as statistical distributions of features that are typical of the topic. As an example, an image can be characterized by visual features extracted from spatial regions, or “patches”, of the image.
Various approaches can be employed for generating the topic model. Non-negative matrix factorization techniques such as Latent Dirichlet Allocation (LDA) or probabilistic latent semantic analysis (PLSA) are known approaches, and have been used in applications such as text clustering, dimensionality reduction of large sparse arrays, or so forth. Underlying the LDA and PLSA models is the observation that a large matrix containing positive values can be approximated by a sum of rank-one positive matrices. Compared to more classical matrix factorization techniques such as Singular Value Decomposition that are rotationally invariant, the low-rank matrices obtained by non-negative decompositions are often nearly sparse, i.e. they contain few large positive values and many small values close to zero. The large values correspond in general to clusters of rows and columns of the original matrices, and are identifiable with topics. These topic models can be formalized as generative models for large sparse positive matrices, e.g. large sets of documents. Topic models are typically used to organize these documents according to themes (that is, topics) in an unsupervised way, that is, without reliance upon document topic annotations or other a priori information about the topics. In this framework, topics are defined as discrete distributions over vocabulary words (for text documents; more generally, distributions over features of objects) and topics are associated to each document according to a relative weighting (proportions).
The following sets forth improved methods and apparatuses.