A common problem in computerized data analysis is forming groups, or clusters, of similar items based on a number of variables describing the items. For example, in a business environment it is often important to form customer groups for precision marketing. The overall goal of clustering is to divide the data into a number of classes, using the variables that describe the data, such that each class contains members that are similar to each other and dissimilar to members of other classes. There are many known techniques for performing clustering. One of the most common techniques is called hierarchical clustering.
Hierarchical clustering does not require, as some other prior art techniques do, that the number of resulting clusters be pre-defined. Instead, the hierarchical clustering technique builds a binary tree in which the original data items are the leaves, and interior nodes represent clusters of items. Each interior node also stores a representation of a measure of the dissimilarity between the two sets of child clusters of the node. Once the binary tree is created, a user analyzing the data can cut the tree at a given level of dissimilarity to create clusterings with different numbers of groups without the need to re-run the clustering algorithm. This ability to cut the tree without the need to re-run the clustering algorithm is very important in the study of large data sets because it allows a user to run a potentially very slow algorithm on a large data set one time, and then examine the resulting structure in various ways without the need to re-run the algorithm and recreate the tree structure. Methods for performing hierarchical clustering are well known in the art and will therefore not be described in detail herein. Such methods are described in, Cluster Analysis, Everitt, B. S., 3d ed., Halsted Press, N.Y. (1993), which is incorporated by reference herein. The particular method used to perform the hierarchical clustering is not critical to the present invention.
Once the tree structure has been created using an appropriate hierarchical clustering method, the tree must be visualized, i.e., a representation of the tree must be generated and displayed on the computer screen for a user. One technique for visualizing the results of a hierarchical clustering algorithm is to simply generate and display a view of the tree structure. However, this technique becomes too cumbersome with even moderate sized data sets.
A better technique is to generate and display a tree-map, which is a technique for visualizing a tree that makes maximal use of screen space. The basic version takes a specified rectangular area and recursively subdivides it up based on the tree structure. The method looks at the first level of the tree and splits up the viewing area horizontally into n rectangles, where n is the number of children of the first node. Each rectangle is allocated an area proportional to the size of the subtree beneath each child node. The method then looks at the next level of the tree and for each node performs the same algorithm, except it recursively divides the area vertically. The algorithm continues doing this subdivision in alternating directions until either the maximum specified depth is reached or a leaf node is reached. In either case, the rectangular area for that node is then drawn with user-specified characteristics such as color, shading and labeling. The algorithm for generating a tree map is well known in the art and is described in, Tree Visualization with Tree Maps; a 2D Space-Filling Approach, Schniederman, ACM Transactions on Graphics, January 1992, which is incorporated herein by reference.