Anyone who has searched for information on the World Wide Web using search sites, such as Google or Yahoo!, is familiar with the process of searching for information in at least one of two ways: by providing a textual query to the search engine describing the information sought (e.g., “Siamese cats”), and by browsing through a hierarchical list of categories provided by the site. For example, in the latter case one might select the category “Animals,” followed by “Mammals,” “Felines,” and “Domestic Cats” to arrive at a list of documents about Siamese cats available on the World Wide Web.
The hierarchical list of categories provided by a search site is one example of a taxonomy. More generally, a taxonomy is a tree structure of hierarchically ordered categories used to classify objects and/or data. Taxonomies are often used to aid and facilitate the systematic retrieval of relevant information out of large amounts of stored data, as the example of the Internet search engine demonstrates.
For a taxonomy to be useful for these purposes, the data must first be classified according to taxonomy by associating each datum (e.g., document) with one or more nodes in the taxonomy. For example, documents that relate to Siamese cats must be tagged in some way as being associated with the “Domestic Cats” node in the taxonomy if the taxonomy-browsing technique described above is to successfully retrieve web pages relating to Siamese cats.
Classifying data according to a taxonomy is a difficult problem, particularly if a large amount of data must be classified. Even classifying a single document may be tedious, time-consuming, and error prone due to the need to: (1) analyze the content of the document, (2) identify any relationships between the document content and the classes defined by nodes in the taxonomy, and (3) identify one or more such nodes with which to associate the document. In many environments, such as corporate or academic intranets, it may be necessary or desirable to perform such classification on millions of documents, to re-classify documents as they change, and to continually classify new documents as they are added to the system. It is particularly desirable to perform such classification as efficiently, reliably, and automatically as possible.