1. Field
The present invention relates to techniques for modeling textual documents. More specifically, the present invention relates to techniques for selecting links in a probabilistic generative model for textual documents, wherein the model characterizes textual documents based on clusters of conceptually related words.
2. Related Art
Processing text in a way that captures its underlying meaning—its semantics—is an often-performed but poorly understood task. This function is most often performed in the context of search engines, which attempt to match documents in some repository to queries by users. It is sometimes also used by other library-like sources of information, for example to find documents with similar content. In general, understanding the semantics of text is an extremely useful functionality in such systems. Unfortunately, most systems written in the past have only a rudimentary understanding of text, focusing only on the words used in the text, and not the underlying meaning behind them.
One existing system captures some of this underlying meaning by characterizing a document with respect to clusters of conceptually related words. For example, see U.S. Pat. No. 7,231,393 entitled, “Method and Apparatus for Learning a Probabilistic Generative Model for Text,” by inventors Georges Harik and Noam Shazeer. This system builds and maintains a model which uses clusters of conceptually related words to capture a significant amount of semantic meaning within text. More specifically, the model contains terminal nodes representing words and cluster nodes representing clusters of conceptually related words. In this model, nodes are coupled together by weighted links, wherein if a node fires, a link from the node to another node causes the other node to fire with a probability proportionate to the weight of the link.
In this type of model, it is desirable to keep track of the probability of every word given every cluster. Unfortunately, this can generate too many links to fit into main memory. For example, if there exist millions of words and hundreds of thousands of clusters, the system must keep track of hundreds of billions of links, and it is not practical to store this many links in main memory. It is therefore desirable to be able to selectively retain the most important links and delete the others. Unfortunately, existing systems provide no satisfactory mechanisms for automatically selecting links to include in such a model.
Hence, what is needed is a method and an apparatus that facilitates automatically selecting links to be included in a generative model for textual documents.