Rapid growth of the World Wide Web has caused an explosion of research aimed at facilitating retrieval, browsing and organization of on-line text documents. Much of this work was directed towards clustering documents into meaningful groups. Often, given a set or hierarchy of document clusters, a user would prefer to quickly browse through the collection to identify clusters without examining particular documents in detail.
The World Wide Web contains a large number of communities of related documents, such as the biology community, or the community of ISP homepages. The present invention is a method for automatically inferring useful hierarchical information about any single community in isolation.
Starting with a set of documents, it is desirable to automatically infer various useful pieces of information about the set. The information might include a descriptive name or a related concept (sometimes not explicitly contained in the documents). Such information has utility for searching or analysis purposes.
Clustering may be defined as the process of organizing objects into groups whose members are similar in some way. There are two major styles of clustering: “partitioning” (often called k-clustering), in which every object is assigned to exactly one group, and “hierarchical clustering”, in which each group of size greater than one may in turn be composed of smaller groups. The advent of World Wide Web search engines and specifically, the problem of organizing the large amount of data available, and the concept of “data mining” massive databases has led to renewed interest in clustering algorithms.
The present invention provides a method that identifies meaningful classes of features in order to promote understanding of a set or cluster of documents. Preferably, there are three classes of features. “Self” features or terms describe the cluster as a whole. “Parent” features or terms describe more general concepts. “Child” features or terms describe specializations of the cluster. For example, given a set of biology documents, a parent term may be science, a self term may be biology, and a child term may be genetics.
The self features can be used as a recommended name for a cluster, while parents and children can be used to place the clusters in the space of a larger collection. Parent features suggest a more general concept, while child features suggest concepts that describe a specialization of the self feature(s).
Automatic discovery of parent, self and child features can be useful for several purposes including automatic labeling of web directories or improving information retrieval. Another important use is automatically naming generated clusters, as well as recommending both more general and more specific concepts contained in the clusters, using only the summary statistics of a single cluster and background collection statistics.
Currently, popular web directories such as Yahoo (http://www.yahoo.com/) or the Open Directory (http://www.dmoz.org/) are human generated and human maintained. Even when categories are defined by humans, automatic hierarchical descriptions can be useful to recommend new parent or child links, or alternative names. The same technology can be useful to improve information retrieval by recommending alternative queries (both more general and more specific queries) based on a retrieved set of documents or pages.
There is a body of previous work related to automatic summarization. For example, Radev and Fan in “Automatic summarization of search engine hit lists”, in Proceedings of ACL'2000 Workshop on Recent Advances in Natural Language Processing and Information Retrieval, Hong Kong, P. R. China, 2000, describe a technique for summarization of a cluster of web documents. Their technique parses the documents into individual sentences and identifies themes or “the most salient passages from the selected documents.” This technique uses “centroid-based summarization” and does not produce sets of hierarchically related features or discover words or phrases not in the cluster.
Lexical techniques have been applied to infer various concept relations from text, see, for example, Marti A. Hearst in “Automatic acquisition of hyponyms from large text corpora”, in Proceedings of the Fourteenth International Conference on Computational Linguistics, Nantes, France (1992); Marti A. Hearst in “Automated discovery in wordnet relations” in the book edited by Christiane Fellbaum, WordNet: An Electronical Lexical Database, MIT Press (1998) and Sharon A. Carballo in “Automatic construction of a hypernym-labeled noun hierarchy from text”, in Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics (1999). Hearst describes a method for finding lexical relations by identifying a set of lexicosyntactic patterns, such as a comma separated list of noun phrases, e.g. “bruises, wounds, broken bones or other injuries.” These patterns are used to suggest types of lexical relationships, for example bruises, wounds and broken bones are all types of injuries. Carabalo describes a technique for automatically constructing a hypernym-labeled noun hierarchy. A hypernym describes a relationship between word A and B if “native speakers of English accept the sentence ‘B is a (kind of) A’.” Linguistics relationships such as those described by Hearst and Caraballo are useful for generating thesauri, but do not necessarily describe the relationship of a cluster of documents to the rest of a collection. Knowing that “baseball is a sport” may be useful if you knew a given cluster was focused on sports. However, the extracted relationships do not necessarily relate to the actual frequency of the concepts in the set. If there is a cluster of sports documents that discusses primarily basketball and hockey, the fact that baseball is also a sport is not as important for describing that set.
Sanderson and Croft in “Deriving concept hierarchies from text”, in Research and Development in Information Retrieval” pages 206–213 (1999) presented a statistical technique based on subsumption relations. In their model, for two terms x and y, x is said to subsume y if the probability of x given y, is 1, and the probability of y given x is less than 1. In the actual model the probability used was 0.8 to reduce noise. A subsumption relationship is suggestive of a parent-child relationship (in the present invention a self-child relationship). This allows a hierarchy to be created in the context of a given cluster. In contrast, the present invention focuses on specific general regions of features identified as “parents” (more general than the common theme), “selfs” (features that define or describe the cluster as a whole) and “children” (features that describe the common sub-concepts).
Popescul and Ungar in “Automatic labeling of document clusters”, an unpublished manuscript available at http://citeseer.nj.nec.com/popsecu100automatic.html, describe a simple statistical technique using χ2 for automatically labeling document clusters. Each (stemmed) feature was assigned a score based on the product of local frequency and predictiveness. The concept of a good cluster label is similar to the present notion of “self features”. A good self feature is one that is both common in the positive set and rare in the negative set, which corresponds to high local frequency and a high predictiveness. In contrast to their work, the present invention considers features that may not be good names, but which promote understanding of a cluster (the parent and child features).
Eric J. Glover et al. in “Using web structure for classifying and describing web pages” in Proceedings of the 11th WWW Conference, Hawaii (2002) describe how ranking features by expected entropy loss can be used to identify good candidates for self names or parent or child concepts. Features that are common in the positive set, and rare in the negative set make good selfs and children, and also demonstrate high expected entropy loss. Parents are also relatively rare in the negative set, and common in the positive set, and are also likely to have high expected entropy loss. The present invention focuses on separating out the different classes of features by considering the specific positive and negative frequencies, as opposed to ranking by a single entropy-based measure.
Another approach to analyzing a single cluster is to divide the cluster into sub-clusters to form a hierarchy of clusters. D. Fasulo in “An analysis of recent work on clustering algorithms”, Technical Report, University of Washington, (1999) available at http://citeseer.nj.nec.com/fasulo99analysi.html provides a summary of a variety of techniques for clustering (and hierarchical clustering) of documents. Kumar et al. in “Trawling the web for emerging cyber-communities” WWW8/Computer Networks, 31 (11–16): 1481–1493 (1999) describe specifically analyzing the web for communities, using the link structure of the web to determine the clusters. Hofmann and Puzicha in “Statistical models for co-occurrence data” Technical Report AIM-1625 (1998) describe several statistical models for co-occurrence data and relevant hierarchical clustering algorithms. They specifically address the Information Retrieval issues and term relationships.
The following example will clarify the difference between the present invention and prior hierarchical clustering work. Suppose a user performs a web search for “biology” and retrieves 20 documents, all of them general biology “hub” pages. Each page is somewhat similar in that they do not focus on a specific aspect of biology. Hierarchical clustering would divide the 20 documents into sub-clusters, where each sub-cluster would represent the “children” concepts. The topmost cluster could arguably be considered the “self” cluster. However, given the sub-clusters, there is no easy way to discern which features (words or phrases) are meaningful names. Is “botany” a better name for a sub-cluster than “university”? In addition, given a group of similar documents, the clustering may not be meaningful. The sub-clusters could focus on irrelevant aspects—such as the fact that half of the documents contain the phrase “copyright 2002”, while the other half do not. This is especially difficult for web pages that are lacking of textual content, i.e. a “welcome page”, or if some of the pages are of mixed topic (even though the cluster as a whole is primarily about biology).
In accordance with the teachings of the present invention, the set of the 20 documents would be analyzed (considering the web structure to handle non-descriptive pages) and a histogram summarizing the occurrence of each feature would be generated (the word frequencies in individual documents would be removed). As used herein, a feature refers to any term or n-gram (single word or phrase). A feature can also be structural information, general properties of a document, or other meaningful descriptions. Structural information may include a word or phrase in the title of a document or it may be a word or phrase in the metatags of a document, and the like. General properties of a document may include factors such as “this is a recent document” or document classifications, such as “news” or “home page”. Such features are typically binary. An analysis of the features in the generated histogram with an analysis of the features in a histogram of all documents (or some larger reference collection) results in identification that the “best” name for the cluster is “biology” and that “science” is a term that describes a more general concept. Likewise, several different “types” of biology would be identified, even though there may be no documents in the set that would form a cluster about the different types. Examples are, “botany”, “cell biology”, “evolution”, and the like. Phrases such as “copyright 2002” would be known to be unimportant because of their frequency in the larger collection. In addition, the use of web structure (extended anchortext which is described below) can significantly improve the ability to name small sets of documents compared to only using the document full text, thereby addressing the problems of non-descriptive pages, for example, “welcome pages”. The histogram of the collection set of documents once created is used in conjunction with any positive set of documents, so long as the collection set is unchanged. That is, the histogram of the collection set of documents may be reused for many different positive sets of documents, as contrasted with regenerating a histogram of the collection set for each positive set.
The present invention provides a method of obtaining a statistical model for predicting parent, child and self features for a relatively small cluster of documents.