The present invention relates generally to generating graph taxonomies and to making content-based recommendations. In particular, related information is classified using a directed acyclic graph. Furthermore, the present invention relates to an automated system and method for generating a graph taxonomy and for recommending to a user a group of documents in a subject area which is related to a document given by the user.
The increased capability to store vast amounts of information has led to a need for efficient techniques for searching and retrieving of information. For example, much information may be found in various databases and on the World Wide Web. Often information may be preprocessed and organized in order to provide users quicker access to relevant documents or data records. In particular, searching for and retrieving information may be facilitated by grouping similar data objects into clusters. Further, groups of similar data objects or clusters may be arranged in a hierarchy. Thus, a hierarchy of clusters may form an abstract representation of stored information.
Electronic documents, for example, may be represented by a tree hierarchy. Each node of the tree hierarchy may represent a cluster of electronic documents, such as, for example, a group of Web pages. Edges connecting nodes of the tree hierarchy may represent a relationship between nodes. Each node in the tree may be labeled with a subject category. Edges of the tree connect higher level nodes or parent nodes with lower level nodes or child nodes. A special node in a tree hierarchy is designated as the root node or null node. The root node has only outgoing edges (no incoming edges) and corresponds to the 0th or highest level of the tree. The level of a node is determined by the number of edges along a path connecting the node with the root node. The lowest level nodes of a tree are referred to as leaf nodes. Thus, a tree hierarchy may be used as a classification of information with the root node being the coarsest (all inclusive) classification and the leaf nodes being the finest classification.
FIG. 1 shows an exemplary tree hierarchy for data objects. In FIG. 1 the root node represents a cluster containing all the available information. Available information may be stored in data objects. Data objects may be, for example, Web pages or links. All data objects belong to the cluster represented by the root node (i.e. level 0). Data objects containing information relevant to the category xe2x80x9cbusinessxe2x80x9d belong to a cluster represented by a level 1 node. Data objects containing information relevant to the category xe2x80x9crecreationxe2x80x9d also belong to a cluster also represented by a level 1 node. Further, data objects containing information relevant to the category xe2x80x9ceducationxe2x80x9d belong to a cluster represented by a level 1 node. The nodes labeled xe2x80x9cbusinessxe2x80x9d, xe2x80x9crecreationxe2x80x9d, and xe2x80x9ceducationxe2x80x9d are all child nodes of the root node. The category xe2x80x9cbusinessxe2x80x9d may be further subdivided into the leaf categories of xe2x80x9clarge businessxe2x80x9d and xe2x80x9csmall businessxe2x80x9d, as indicated by two level 2 nodes. Nodes labeled xe2x80x9clarge businessxe2x80x9d and xe2x80x9csmall businessxe2x80x9d are both child nodes of the node labeled xe2x80x9cbusinessxe2x80x9d. The category xe2x80x9crecreationxe2x80x9d may be further subdivided into the leaf categories of xe2x80x9cmoviesxe2x80x9d, xe2x80x9cgamesxe2x80x9d, and xe2x80x9ctravelxe2x80x9d, as indicated by three level 2 nodes. Nodes labeled xe2x80x9cmoviesxe2x80x9d, xe2x80x9cgamesxe2x80x9d, and xe2x80x9ctravelxe2x80x9d are all child nodes of the node labeled xe2x80x9crecreationxe2x80x9d. The category xe2x80x9cEducationxe2x80x9d may be further subdivided into the leaf categories of xe2x80x9cHigh-Schoolsxe2x80x9d, xe2x80x9ccollegesxe2x80x9d, xe2x80x9cUniversitiesxe2x80x9d, and xe2x80x9cinstitutesxe2x80x9d, as indicated by four level 2 nodes. Nodes labeled xe2x80x9cHigh-Schoolsxe2x80x9d, xe2x80x9ccollegesxe2x80x9d, xe2x80x9cUniversitiesxe2x80x9d, and xe2x80x9cinstitutesxe2x80x9d are all child nodes of the node labeled xe2x80x9cEducationxe2x80x9d.
A tree hierarchy may serve as a guide for searching for a subject category of data objects in which a user may be interested. For example, a test document, containing keywords which indicate an area of interest, may be given by a user. Based on a test document a tree hierarchy of subject categories may be searched for a node which matches the subject area sought by the user. Once a matching subject area is found, information associated with the matching subject area may be retrieved by the user.
Typically, a tree hierarchy may be searched in a top down fashion beginning with the root node and descending towards the leaf nodes. At each stage of a search, edges or branches are assigned a score. The branch with the highest score indicates the search (descent) direction of the tree. As higher levels of the tree are searched first, and as higher levels are often associated with broader subjects, errors in matching subject areas may lead to erroneous recommendation. In other words, as attaching a descriptive label to higher level nodes may be difficult, an error in matching a subject area to nodes at the beginning of a top down search may lead to a search through irrelevant branches of the tree.
Forming a classification of data is referred to as generating a taxonomy (e.g. a tree hierarchy). The data which is used in order to generate a taxonomy is referred to as training data. The process of finding the closest matching subject area to a given test document is referred to as xe2x80x98making content-based recommendationsxe2x80x99. Methods for taxonomy generation and applications to document browsing and to performing recommendations have been previously proposed in the technical literature. For example, Douglas R. Cutting, David R. Karger, and Jan O. Pedersen, xe2x80x9cConstant interaction-time scatter/gather browsing of large document collections,xe2x80x9d Proceedings of the ACM SIGIR, 1993; Douglas R. Cutting, David R. Karger, Jan O. Pedersen, and John W. Tukey, xe2x80x9cScatter/Gather: A cluster-based Approach to Browsing Large Document Collections,xe2x80x9d Proceedings of the ACM SIGIR, 1992, pp. 318-329; Hearst Marti A., and Pedersen J. O., xe2x80x9cRe-examining the cluster hypothesis: Scatter/Gather on Retrieval Results,xe2x80x9d Proceedings of the ACM SIGIR, 1996, pp. 76-84, 1996; Anick P. G., and Vaithyanathan S., xe2x80x9cExploiting clustering and phrases for Context-Based Information Retrieval,xe2x80x9d Proceedings of the ACM SIGIR, 1997, pp. 314-322; and Schutze H., and Silverstein C., xe2x80x9cProjections for efficient document clustering,xe2x80x9d Proceedings of the ACM SIGIR, 1997, pp. 74-81.
Exemplary applications of content-based recommendations methods are in facilitating a search by a user for information posted on the World Wide Web. The content of Web Pages may be analyzed in order to classify links to Web Pages in the appropriate category. Such a method is employed, for example, by WiseWire Corporation (recently acquired by Lycos Inc., http://www.lycos.com). Lycos builds a directory index for Web Pages using a combination of user feedback and so-called intelligent agents. Links to Web Pages may be organized in a hierarchical directory structure which is intended to deliver accurate search results. At the highest level of the hierarchy subject categories may be few and generic, while at the lower levels subject may be more specific. A similar directory structure may be found in other search engines such as that employed by Yahoo Inc. (http://www.yahoo.com).
A graph taxonomy of information which is represented by a plurality of vectors is generated. The graph taxonomy includes a plurality of nodes and a plurality of edges. The plurality of nodes is generated, and each node of the plurality of nodes is associated with ones of the plurality of vectors. A tree hierarchy is established based on the plurality of nodes. A plurality of distances between ones of the plurality of nodes is calculated. Ones of the plurality of nodes are connected with other ones of the plurality of nodes by ones of the plurality of edges based on the plurality of distances.