1. Field of the Invention
The present invention relates to techniques for modeling textual documents. More specifically, the present invention relates to a technique for selectively deleting clusters of conceptually related words from a probabilistic generative model for textual documents, wherein the model characterizes textual documents based on clusters of conceptually related words.
2. Related Art
Processing text in a way that captures its underlying meaning—its semantics—is an often performed but poorly understood task. This function is most often performed in the context of search engines, which attempt to match documents in some repository to queries by users. It is sometimes also used by other library-like sources of information, for example to find documents with similar content. In general, understanding the semantics of text is an extremely useful functionality in such systems. Unfortunately, most systems written in the past have only a rudimentary understanding of text, focusing only on the words used in the text, and not the meaning behind them.
As an example, consider the actions of a user interested in finding a cooking class in palo-alto, california. This user might type into a popular search engine the set of words “cooking classes palo alto”. The search engine then typically looks for those words on web pages, and combines that information with other information about such pages to return candidate results to the user. Currently, if the document has the words “cooking class palo alto” several of the leading search engines will not find it, because they do not know that the words “class” and “classes” are related, because one is a subpart—a stem—of the other.
Prototype systems with stemming components have been attempted but without any real success. This is because the problem of determining whether a stem can be used in a particular context is difficult. That might be determined more by other nearby words in the text rather than by the word to be stemmed itself. For example, if one were looking for the James Bond movie, “for your eyes only”, a result that returned a page with the words “for your eye only” might not look as good.
One existing system characterizes a document with respect to clusters of conceptually related words. For example, see U.S. patent application Ser. No. 10/676,571 entitled, “Method and Apparatus for Characterizing Documents based on Clusters of Related Words,” by inventors George Harik and Noam Shazeer, filed 30 Sep. 2003. This system uses clusters of conceptually related words to capture a significant amount of semantic meaning within text.
These clusters are formed during a training phase which considers a large number of documents while forming a generative model for the text. However, overfitting commonly occurs during the training phase which leads to clusters having just a few words. Such small clusters do not generalize well and hence are not useful for capturing semantic meaning. Consequently, the presence of such clusters in the generative model reduces processing efficiency and consumes memory without providing any benefits. These small clusters can also adversely affect the quality of clusters returned by the model by preventing good clusters from being activated.
Hence, what is needed is a method and an apparatus that facilitates selectively deleting less-useful clusters from such a generative model for text.