The present invention relates to characterizing linked documents.
A fundamental problem in the information retrieval field is to characterize the content of documents. By capturing the essential characteristics in the documents, one gives the documents a new representation, which is often more parsimonious and less noise-sensitive. Such a new representation facilitates efficient organizing and searching of the documents and therefore improves user experiences in an information retrieval system. The new representation may also benefit other tasks such as classifying, clustering, and visualizing the documents.
The need for characterizing document content has increased significantly due to the rapid development of the Internet which has made available huge document repositories (such as digital libraries) online. Documents in many corpora, such as digital libraries and webpages, typically contain both content and link information.
Among the existing methods that extract essential characteristics from documents, topic model plays a central role. The topic model bridges the gap between the documents and words by characterizing the content of documents in terms of the latent semantic space to enable capabilities such as clustering, summarizing and visualizing. The topic model also provides meaningful interpretation of the documents through a probabilistic generative model which associates the document with a set of topics by the membership probabilities.
Topic models extract a set of latent topics from a corpus and as a consequence represent the documents in a new latent semantic space. This new semantic space captures the essential latent topics of each document and therefore enable efficient organizing of the corpus for the tasks such as browsing, clustering, and visualizing. One of the well-known topic models is the Probabilistic Latent Semantic Indexing (PLSI) model. In PLSI each document is modeled as a probabilistic mixture of a set of topics. Another approach called PHITS uses a probabilistic model for the links which assumes a generative process for the links similar to that in PLSI. Thus, PHITS ignores the content of the documents and characterize the documents by the links.
A Latent Dirichlet Allocation (LDA) model has been used which incorporates a prior of the topic distributions of the documents. In these probabilistic topic models, one assumption underpinning the generative process is that the documents are independent. More specifically, it is assumed that the topic distribution of each document is independent of those of other documents. However, this assumption does not always hold in practice, because documents in a corpus are usually related to each other in certain ways. Often, one can explicitly or implicitly observe such relation in a corpus, e.g., through the citations and co-authors of a paper or through the content similarity among documents. In such a case, these observations should be incorporated into the topic model in some way in order to derive more accurate latent topics that reflect the relation among the documents well. The LDA model is a parametric empirical Bayes model which introduce a Dirichlet prior for the topic distributions of the documents. One difficulty in LDA is that the posterior distribution is intractable for exact inference and thus an approximation inference algorithm has to be considered. Introduction of the prior makes it possible to generate new documents which is not available in the training stage, but the approximation inference algorithm is slower than PLSI in practice, which might be an issue for large corpora. The author-topic model has been used to extend LDA by including the authors information. Specifically, the author-topic model considers the topic distribution of the document as a mixture of topic distributions of the authors. Consequently, the author-topic model implicitly consider the relations among the document through the authors. Similar to the author-topic model, the CT model and the BPT model explicitly consider the relations among the documents by modeling the topic distribution of each document as a mixture of the topic distribution of the related documents.
With the development of internet, most of the webpages and documents are linked to each other by the hyperlinks. Incorporating link information into the topic model is expected to provide better document modeling. Recent studies have attempted to combine both the contents and the links in a corpus of linked documents. For example, the PLSI model has been applied once on contents and another time on links, and combined in a linear fashion. As another example, the contents and links have been fused into a single objective function for optimization. However, these approaches have treated links as a feature in a similar way that the content features are treated. Such a yet-another-feature approach to treat links ignored two important properties of links. First, links are used to represent relations; and it is the relations represented by the links, not the links themselves, that are important to a topic model. The second property of links that is ignored by the above studies is that the relations represented by the links are often transitive.
In another trend, document clustering is a fundamental tool for these tasks and an important application of topic models. K-means clustering is a widely used clustering algorithm which minimizes the sum of squared errors between the documents and the cluster centers. Spectral clustering has emerged as one of the most effective document clustering methods. In the spectral clustering, an undirected graph is constructed where nodes represent the documents and edges between nodes represent the similarity between the documents. The document clustering task is accomplished by finding best cuts of the graph that optimizes certain predefined criterion functions, which usually leads to the computation of the eigenvectors of certain matrices. Another important class of the document clustering methods rely on non-negative matrix factorization (NMF) technique.