Field of the Invention
The present invention relates to techniques for modeling textual documents. More specifically, the present invention relates to a technique for selectively merging clusters of conceptually related words in a probabilistic generative model for textual documents, wherein the model characterizes textual documents based on clusters of conceptually related words.
Related Art
Processing text in a way that captures its underlying meaning—its semantics—is an often performed but poorly understood task. This function is most often performed in the context of search engines, which attempt to match documents in some repository to queries by users. It is sometimes also used by other library-like sources of information, for example to find documents with similar content. In general, understanding the semantics of text is an extremely useful functionality in such systems. Unfortunately, most systems written in the past have only a rudimentary understanding of text, focusing only on the words used in the text, and not the meaning behind them.
As an example, consider the actions of a user interested in finding a cooking class in palo-alto, California. This user might type into a popular search engine the set of words “cooking classes palo alto”. The search engine then typically looks for those words on web pages, and combines that information with other information about such pages to return candidate results to the user. Currently, if the document has the words “cooking class palo alto” several of the leading search engines will not find it, because they do not know that the words “class” and “classes” are related, because one is a subpart—a stem—of the other.
Prototype systems with stemming components have been attempted but without any real success. This is because the problem of determining whether a stem can be used in a particular context is difficult. That might be determined more by other nearby words in the text rather than by the word to be stemmed itself. For example, if one were looking for the James Bond movie, “for your eyes only”, a result that returned a page with the words “for your eye only” might not look as good.
One existing system characterizes a document with respect to clusters of conceptually related words. For example, see U.S. patent application Ser. No. 10/676,571 entitled, “Method and Apparatus for Characterizing Documents based on Clusters of Related Words,” by inventors George Harik and Noam Shazeer, filed 30 Sep. 2003. This system uses clusters of conceptually related words to capture a significant amount of semantic meaning within text.
These clusters are formed during a training phase which uses a large number of documents to form a generative model for the text in these documents. During this training process, it is common for separate clusters to form for similar topics. For example, separate clusters may form for “George Bush jokes,” and “George Bush memorabilia.” It is desirable to merge such similar clusters into a combined cluster which will do a better job of generalizing. Unfortunately, existing systems provide no automated mechanism for merging such similar clusters.
Hence, what is needed is a method and an apparatus that facilitates automatically merging similar clusters of conceptually related words in a generative model for textual documents.