Information has traditionally been manually classified to aid potential readers in locating works of interest. For example, books are typically associated with categorization information (e.g., one or more Library of Congress classifications) and academic articles sometimes bear a list of keywords selected by their authors or editors. Unfortunately, while manual classification may be routinely performed for certain types of information such as books and academic papers, it is not performed (and may not be feasibly performed) for other types of information, such as the predominantly unstructured data found on the World Wide Web.
Attempts to automatically classify documents can also be problematic. For example, one technique for document classification is to designate as the topic of a given document the term occurring most frequently in that document. A problem with this approach is that in some cases, the most frequently occurring term in a document is not a meaningful description of the document itself. Another problem with this approach is that terms can have ambiguous meanings. For example, documents about Panthera onca, the British luxury car manufacturer, and the operating system might all be automatically classified using the term “Jaguar.” Unfortunately, a reader interested in one meaning of the term may wind up having to sift through documents pertaining to all of the other meanings as well.