Applicants claim the foreign priority benefits under 35 U.S.C. 119 of U.K. Patent Application No. 9517988.3 filed Sep. 4, 1995, which is incorporated by reference into this application.
This invention relates to an interactive, tree structured, graphical visualization aid for use with digitally stored collections of data elements, such as documents, programs and other data files.
Due the increased availability and use of CD-ROM storage media, the use of digitally stored textual matter or other information such as sound, image and video files has become more prevalent among computer users. Using modern CD-ROM technology a small number of CD-ROMs can be used to store a large collection of information. Typically, the information is organized by topic for ease of access. Thus, when producing a collection of this type, information elements must be manually gathered within clusters that deal with the same or related topics. In the case of documents or books, such clusters are sometimes referred to as bookshelves. This tedious organization task can be automated using cluster analysis techniques. Unfortunately, as described below standard numerical clustering techniques generate clustering hierarchies that are difficult to interpret for non-expert users.
A wide range of cluster analysis techniques have been developed for identifying underlying structures in large sets of objects and revealing links between objects or classes of objects. In the following, the objects to which the clustering process is applied will be referred to as information elements or data elements. There is no strict definition of a cluster, but in general terms a cluster is a group of objects whose members are more similar to each other than to the members of any other group. Typically, the goal of cluster analysis is to determine a set of clusters, such that inter-cluster similarity is low and intra-cluster similarity is high.
One well known clustering technique is the Hierarchical Agglomerative Clustering (HAC). This method takes as input a collection of objects and organizes them into a binary cluster hierarchy, or dendrogram. The key characteristic of a dendrogram is that each node represents a cluster formed by merging of the clusters which are its direct descendants in the tree. A leaf is a singleton cluster containing a single information element. Each level of the dendrogram, from the leaves to the root, forms a partition of the original set of elements.
However, making use of dendrograms to enable a user to understand the underlying structure of a collection of information elements has certain drawbacks.
First, dendrograms are difficult for users to visualize since they are laid out as trees and it is often difficult for novice users to understand that each node represents a cluster of information elements.
Secondly, dendrograms are difficult to interpret. One major weakness of numerical clustering algorithms is that clusters are defined extensively, i.e., by enumeration of their members, rather than intensively, i.e., by membership rules. In other words, the mere fact that a number of information elements have been grouped together in a cluster tells the user nothing in itself about the characteristics of the elements that have led to them being grouped in such a manner.
The problem of displaying clusters of information elements to the user has already been addressed in the past. It has been often proposed to represent the cluster information defined by the dendrogram not as a tree diagram, but in a completely different manner. One typical example is the layout proposed by R. A. Botafogo in `Cluster Analysis for Hypertext Systems ` Proceedings of ACM SIGIR'93 (1993) (see in particular FIG. 7, p 122) that represents the pairwise similarity between documents as well as clusters as levels on a 2-dimensional space.