In information searching and retrieval, particularly on the internet, a number of techniques are is known that "discover" features in sets of pre-categorized documents, such that similar documents can be found. While such techniques are capable of classifying documents with high accuracy, they are not necessarily useful in locating all types of similar documents.
For example, at the moment, most internet users are given two options for finding more documents similar to ones they already have. The first is to go to a search engine provider that has collected thousands of documents into pre-categorized topics. With this option, one simply searches the topic hierarchy for similar documents. The other option is to go to another ISP that is essentially a keyword search engine. Such engine ISPs collect millions of documents which allow users to construct keyword searches for similar documents.
However, both of these approaches have their flaws. The pre-categorized providers cannot provide a personal, idiosyncratic view of the web. Users will likely want to search document categories that are not provided. On the other hand, the word-search engines have the problem of not being able to capture the "meaning" of a set of user documents so that an appropriate query can be created.
As a result, a technique is needed for extracting word clusters from the raw document features. Such a clustering technique would be successful in discovering word groups which can be used to find similar information. In such a fashion, an internet user would be able to "personalize the web". More particularly, users would assemble a group of documents as they choose, then each group would be examined to determine what is unique about those documents, and a set of words would be provided to identify those documents. These word sets would be very useful in finding similar documents, much more so than word sets generated by users.