The present invention is related to the field electronic document collection, and in particular to forming clusters of electronic documents based on some measure of similarity between the electronic documents.
There is an ever increasing volume of electronic information available to a user, as for example found on the World-Wide Web (hereinafter referred to as the Web). Large collections of Web documents are common in many Web-based activities, including searches and analysis of trends and use patterns. Also, as automated commerce develops, manipulating such collections becomes an important activity for software agents. Both consumers and producers of information, therefore would like to understand what kinds of information are available, how desirable the information is, and how its content and use change through time.
Since these collections of data are often unmanageably large, they are often grouped into clusters based on some measure of similarity between documents. Such measures can be based on similar key words, link structures, use patterns or recommendations. Given a choice of similarity measurements, a clustering algorithm is used to form clusters or clusterings containing documents with some form of relatedness. Numerous clustering algorithms have been developed and are in use. For example, clustering algorithms are used by search engines such as
Northern Light: (url:http://www.northernlight.com)
or Lycos: (url:http:  www.lycos.com).
A specific approach to the clustering of documents involves computing inter-document similarities based on content-word frequency statistics. However, a drawback with this technique is that not only is it often expensive, but more importantly its effectiveness was developed and tuned on human-readable texts. It appears, though, that the proportion of human-readable source files for web pages is decreasing with the infusion of dynamic and programmed pages.
Another option of performing clustering of documents is to look at usage patterns. Unfortunately, any clustering based on usage patterns requires access to data that is not usually recorded in any easily accessible format.
Other attempts at clustering hypertext, for example, typically utilize the hypertext link topology of the collection. Such a basis for clustering makes intuitive sense since the links of a particular document represent what the author felt was of interest to the reader of the document. However, such systems are not particularly suited to scale gracefully to large heterogeneous collections like the web.
While the foregoing described clustering techniques are only a representative sampling of those in existence, a common thread among these and others is the use of similarity measures between documents to form a cluster. Unfortunately, similarity measures that are easy to compute do not necessarily capture the relevant aspects of similarity for particular applications. For instance, measures based on matching words can be misled by multiple meanings for words or synonyms. Measures based on link structure or use patterns may be confused by the use of out-of-date web crawlers and the many different types of users and reasons links are included in pages. Such errors in similarity measures can reduce the usefulness of the clusters.
Another problem arises when web documents are returned over a long time interval. In some situations, it may be desirable to begin clustering when only a few documents are available to present some immediate feedback to a user. Once such initial clusters are presented, additional documents that arrive may be forced into inappropriate clusters or force significant changes in the clustering. An example is the on-line hierarchical clustering of news stories. In this case, it may be desirable to have a clustering algorithm that is insensitive to the order in which the stories are received.
In considering the shortcomings of existing clustering algorithms, the inventors have investigated the field of manufacturing self-assembly. Manufacturing often builds objects from their components by directly placing them in the necessary arrangements. Common examples include buildings, cars and electronic circuits. This technique requires knowledge of the precise structure needed to serve a desired function, the ability to create the components with the necessary tolerances and the ability to place each component in its proper location in the final structure.
When these requirements can not be met, self-assembly offers another approach to building structures from components. This process involves a statistical exploration of many possible structures before settling into the final one. The particular structure produced from given components is determined by biases in the exploration, given by component interactions. These biases arise when the strength of interaction between components depends on their relative locations in the structure. Interactions can reflect constraints on the desirability of a component being near its neighbors in the final structure. For each possible structure, the interactions combine to give a measure of the extent to which the constraints are violated, which can be viewed as a cost or energy for that structure. Through the biased statistical exploration of structures, each set of components tend to assemble into that structure with the minimum energy for that set. In these terms, self-assembly can be viewed as a process using a local specification, in terms of components and their interactions, to produce a resulting global structure. The local specification is, in effect, a set of instructions that implicitly describes the resulting structure.
Self-assembly can be very precise in spite of the inherently statistical nature of the process. Examples include chemical reactions driven by diffusive mixing of the reactants, such as the creation of polymers, proteins, and molecular assemblies, patterned mesoscale objects and structures consisting of tiny robots. This technique can also automatically reconfigure structures when their environments or task requirements change, or when a few components break.
The inventors have determined the self-assembly process to be valuable to approach the task of generating more useful and robust clusters by applying concepts from the field of self-assembly to the clustering of web-based documents.
A method and apparatus for clustering documents from a set of documents includes issuing a request from an electronic device for documents which are relevant to the request. The documents considered to meet the request requirements are identified and the documents are thereafter supplied to a document clustering system. The document clustering system operates to form clusterings of the documents. A self-assembly process is instituted to then obtain robust clusterings and the robust clusterings are displayed for the user""s view. Robust clusterings have a higher designability feature than other clusters which are generated, where designability refers to a number of ways a particular clustering can be formed.