1. Field of the Invention
The invention relates to the field of data processing. More specifically, the invention relates to the automatic analysis of the content of electronic data objects and the categorization of the electronic data objects into one or more discrete categories.
2. Background Information
The Internet consists of billions of discrete pages, which can be accessed from any browser-equipped computer or appliance connected to the World Wide Web (hereinafter xe2x80x9cWebxe2x80x9d). The availability of so many pages simultaneously represents both a boon and a bane to the user. Information, opinion, and news are available about a vast array of topics, but the challenge is to find those pages of the Web which are most relevant to the particular needs or desires of the user at any given moment.
A number of search engines are available on the Web for free use. These search engines typically index some fraction of the pages available on the Web, and provide users with the ability to search for information on the Web using keywords or may not know how to correctly formulate a search query to find the most appropriate page(s).
Another method of organizing the Web is the use of categorical hierarchies. Certain companies have analyzed the contents of tens or hundreds of thousands of web pages, placing each page into one or more of the categories in their particular subject hierarchy. Users can then browse such subject hierarchies, or search through them based upon keywords. Such searches provide results annotated with the subject area of the target page, which can assist the user in determining whether the page might be relevant to the actual topic being searched.
FIG. 10 illustrates an exemplary prior art subject hierarchy 1002 in which multiple decision nodes (hereinafter xe2x80x9cnodesxe2x80x9d) 1030-1036 are hierarchically arranged into multiple parent and/or child nodes, each of which are associated with a unique subject category. For example, node 1030 is a parent node to nodes 1031 and 1032, while nodes 1031 and 1032 are child nodes to node 1030. Because nodes 1031 and 1032 are both child nodes of the same node (e.g. node 1030), nodes 1031 and 1032 are said to be siblings of one another. Additional sibling pairs in subject hierarchy 1002 include nodes 1033 and 1034, as well as nodes 1035 and 1036. It can be seen from FIG. 10 that node 1030 forms a first level 1037 of subject hierarchy 1002, while nodes 1031-1032 form a second level 1038 of subject hierarchy 1002, and nodes 1033-1036 form a third level 1039 of subject hierarchy 1002. Additionally, node 1030 is referred to as a root node of subject hierarchy 1002 in that it is not a child of any other node.
In general, search hierarchies are filled with pages by manual classification of individual web pages using the talents of experts in particular subject fields. This method has several problems, including the cost of finding experts to perform the classification, and the necessary backlog between the time a site is placed on the Web and the time (if ever) it enters the classification hierarchy, moreover a grader expert in one subject area may misclassify a page of another subject, which can make the page more difficult to find for the casual browser.
Although this is an active area of research, existing systems typically work with only a limited number of subject fields and often display poor performance. Therefore, what is desired is an automatic system for classifying a large number of documents quickly and effectively into a large subject hierarchy.