1. Field of the Invention
The invention relates to the field of data processing. More specifically, the invention relates to the automatic selection of features of objects for use in classifying the objects into groups.
2. Background Information
The World Wide Web provides an important information resource, with estimates of billions of pages of information available for online viewing and downloading. In order to make efficient use of this information, however, a sensible method for navigating this huge expanse of data is necessary.
In the early days of Internet surfing, two basic methods were developed for assisting in Web searches. In the first approach, an indexed database is created based upon the contents of Web pages gathered by automated search engines which “crawl” the web looking for new and unique pages. This database can then be searched using various query techniques, and often ranked on the basis of similarity to the form of the query. In the second approach, Web pages are grouped into a categorical hierarchy, typically presented in a tree form. The user then makes a series of selections while descending the hierarchy, with two or more choices at each level representing salient differences between the sub trees below the decision point, ultimately reaching leaf nodes which contain pages of text and/or multimedia content.
For example, FIG. 1 illustrates an exemplary prior art subject hierarchy 102 in which multiple decision nodes (hereinafter “nodes”) 130-136 are hierarchically arranged into multiple parent and/or child nodes, each of which are associated with a unique subject category. For example, node 130 is a parent node to nodes 131 and 132, while nodes 131 and 132 are child nodes to node 130. Because nodes 131 and 132 are both child nodes of the same node (e.g. node 130), nodes 131 and 132 are said to be siblings of one another. Additional sibling pairs in subject hierarchy 102 include nodes 133 and 134, as well as nodes 135 and 136. It can be seen from FIG. 1 that node 130 forms a first level 137 of subject hierarchy 102, while nodes 131-132 form a second level 138 of subject hierarchy 102, and nodes 133-136 form a third level 139 of subject hierarchy 102. Additionally, node 130 is referred to as a root node of subject hierarchy 102 in that it is not a child of any other node.
The process of creating a hierarchical categorization for Web pages presents multiple challenges. First, the nature of the hierarchy must be defined. This is typically done manually by experts within a particular subject area, in a manner similar to the creating of categories in the Dewey Decimal System for libraries. These categories are then provided with descriptive labels so that users and categorizers can make appropriate decisions while navigating the hierarchy. Content in the form of individual electronic documents for example are then placed into the categories by means of a manual search through the hierarchy.
In recent years attention has been directed towards automating the various stages of this process. Systems exist for the automatic categorization of documents from a corpus of documents. For example, some systems utilize key words associated with documents to automatically cluster or group similar documents. Such clusters can be iteratively grouped into super-clusters, thus creating a hierarchical structure, however, these systems require manual insertion of key words, and produce a hierarchy with no systematic structure. If the hierarchy is to be used for manual search, labels must be affixed to the nodes of the hierarchy by manual examination of the sub nodes or leaf documents to identify common feature(s).
Many classification systems utilize lists of salient words for classifying documents. Typically, salient words are either predefined or selected from the documents being processed to more accurately characterize the documents. Commonly these salient word lists are created by counting the frequency of occurrence of all words for each of a set of documents. Words are then removed from the word lists according to one or more criteria. Often, words that occur too few times within the corpus are eliminated, since such words are too rare to reliably distinguish among categories, whereas words that occur too frequently are eliminated, because such words are assumed to occur commonly in all documents across categories.
Further, “stop words” and word stems are often eliminated from feature lists to facilitate salient feature determination. Stop words comprise words which are common in the language such as “a”, “the”, “his”, and “and”, which are felt to carry no semantic content, whereas word stems represent suffixes such as “-ing”, “-end”, “-is”, and “-able”. Unfortunately, the creation of stop word and word stem lists is a language-specific task, requiring expert knowledge of syntax, grammar, and usage, which may change with time. Thus, a more flexible way of determining salient features is therefore desirable.