1. Field of the Invention
The subject of this invention is information cartography. The new process transforms relational data for display as a map, projection, or three-dimensional shape bearing characteristics like those of a typical, topographic diagram.
2. Description of the Related Art
The rendering of multidimensional data is currently performed by the statistical techniques of cluster analysis and multidimensional scaling (MDS). These aspects of the related art are discussed herein, along with an overview of the relational database technology that serves as a source for the data to be transformed.
Cluster analyses is an established method for the attribute based classification of objects. Its purpose is to organize large volumes of data into meaningful groups. An example of the data that might be used in cluster analysis is shown in FIG. 1. This Figure depicts a number of large, publicly held companies along with some financial data from 1996. The decimal numbers have been truncated for convenience. While all of the data that is needed to compare the companies is shown in the table, time consuming study of the numbers is required to answer questions such as which companies have similar financial structures, which are very different from each other, or which are larger or smaller. Cluster analysis seeks to answer these questions, concerning the relationships between the objects, by converting the printed numbers into a more meaningful graphical representation.
In cluster analysis, each column of data is considered to be a dimension, in the mathematical sense. Thus, for example, we could take the first three columns, Sales for the trailing 12 months, Gross Income for the Year, and Available Cash and assign them to the X, Y, and Z axes of a three dimensional plot. Each company would be represented by a single point on this plot. DuPont, for example would be located at 43810, 18666, 1319. Even though we cannot imagine it, we can mathematically treat all of the other columns as additional dimensions. Each company, each object, is located at one particular point in this multidimensional attribute space.
Cluster analysis is concerned with the locations of the objects in attribute space. It creates clustering diagrams, called dendrograms, that show which objects are close together and which objects are farther from one another.
The process of cluster analysis begins by standardizing the data in each column. This step is optional, but in practice is almost always employed. As well described in the literature, each value is scaled relative to the average value in the column, and divided by the standard deviation of the entire column. This reduces all of the data in all columns to a standard range of values, so no column of data gets an unfair influence on the final analysis. In the example above, the % SG&A, i.e., selling, General, and Administrative Costs, to Sales data can have a maximum value of 99.9, whereas the Sales data can range into the millions. If the information were not standardized, the trivial magnitude of % SG&A to Sales would be swamped by the larger numbers.
The current art includes other methods for standardization of data, each applicable to specific situations. The literature describes methods for dealing with proportional data, boolean data, and outliers. The invention described herein works with all of these methods.
Next, if the user of the technique has so chosen, the data columns are given weights to deliberately control their outcome on the final analysis. To a stock analyst, the %SG&A to Sales number is an important indicator of efficiency, but it is not as valuable in assessing a company as the Sales for the prior 12 months. Thus, the investigator defines a set of weights for the columns. Each piece of data is then multiplied by the weighting factor for the column. Some example weighting factors for evaluation of a publicly held company are shown on the bottom row of FIG. 1.
As described in the literature, the clustering of objects then begins with a calculation of the distances between each object and every other object. In this respect, the term distance is often the Euclidean distance that is calculated as the square root of the sum of the squared differences between the values. This is the Pythagorean Theorem applied to multidimensional data. Other measures of distance can be applied equally well. The `city-block` distance, also known as the graph-theoretical distance, for example, is the sum of the distances that must be traveled along each dimensional axis to travel from one object to the other. Another measure is the angular separation between the objects, as viewed from the zero point.
Once distances have been calculated, the first cluster is defined as the combination of the two nearest objects or neighbors. These two objects are then replaced by the group they form and combined or averaged attributes are calculated for the group.
There are several ways that this combination can be performed, the most typical being a weighted averaging of the values. Other typical ways are discussed in the literature and include unweighted averaging of the data. Another method does not combine the attributes at all, but simply keeps track of the ranges of their individual values. In this case the subsequent groupings are based upon the nearest, or sometimes furthest, objects in each group rather than average values.
The cluster analysis continues by combining the next closest objects, with the provision that the group that was just created is also available for clustering. Unlike the initial objects which are represented as points in space, as we described above, groups can be represented as an average point or a range of values. This leads to various linking rules for groups. These rules, are discussed in the literature and include nearest neighbors, furthest neighbors, and others.
The analysis continues until all of the objects and groups are clustered into a single, universal group. Unlike many statistical processes however, the appeal of cluster analysis is not held by the final result, but rather by the process used to arrive at it. The complete clustering process is shown on a special figure, known as a cluster diagram, or dendrogram. FIG. 2 illustrates an example dendrogram for the companies based upon the data in FIG. 1.
The dendrogram makes use of nested brackets to graphically illustrate the relationships between objects. The horizontal lines on the far left represent the objects in the analysis, in this case the publicly held companies. Vertical lines that attach two object lines are located at the distance, indicated by the scale at the top, that separates the objects. The other horizontal lines point to groups that are being incorporated into a new group, and their vertical attachment lines show the separation distance between the incorporated objects or groups. The object separation distance is a somewhat arbitrary scale, completely dependent upon the number of attributes analyzed and the weighting factors applied to each. This varies considerably from one model to another. Even with this limitation however, the human eye easily adapts to the interpretation of dendrograms. Small groups of closely related objects are easily discerned, as are specious groupings that fall out in the process of completing the analysis.
A dendrogram of a complex set of data is often full of revelations. In FIG. 2, for example, one can quickly discern financial relationships that can only be discovered through painstaking study of the data in FIG. 1. Mobil and British Petroleum have very similar financial data, while Ford and General Motors stand apart from the pack. Note that the dendrogram itself offers no information as to why these relationships exist. It shows no actual data.
The dendrogram is often used to define categories. To visualize this, imagine cutting the dendrogram in FIG. 2 with a vertical line corresponding to the distance of approximately 11. Any complete group falling just to the left of this line becomes a category; the objects in the group presumably have similar characteristics. If one performed this operation on FIG. 2, it would create several small groups, some individual companies, and one large group containing Texaco, Philip Morris, Chrysler, Sony, British Petroleum, Mobil, and others.
Two other graphical representations are used in the current art to interpret a cluster analysis. The first is a dendrogram of the attributes used in the analysis. This dendrogram can also provide important insights into the data. For example, if we were clustering people based upon their appearance, such as, hair color, height, eye color, etc. we might find a group with blue eyes whose ancestors came from Northern Europe. By looking at the dendrogram of characteristics we might note a close association between blonde hair and blue eyes and conclude that we had found a group representing the stereotypical `Aryan` race. However, the explanatory power of the attribute cluster is often limited by the use of weighting factors in the cluster analysis. When weights are not used, the relationships are easy to see; otherwise, a significant amount of study and interpretation is required to derive meaning from attribute dendrograms.
The final graphical representation used in the current art is called a discrete contour plot. This plot is a rectangular array of the initial data, reordered from top to bottom to correspond with the order of the object dendrogram, and reordered left to right to correspond to the order of the attribute dendrogram. The initial object data is usually replaced by a color scale, or this plot can be generated as a three dimensional surface representation where the heights indicate the magnitude of the data values. Along with the dendrograms, this plot supports exploratory browsing of the data and clusters. Meaningful interpretations can sometimes be made by looking for islands or bands of color. FIG. 3 shows a complete analysis using a gray scale for the data in FIG. 1.
The gray scale contour plot of FIG. 3 helps to explain the object and attribute clustering that appears on its borders. The grouping of Ford and General Motors is readily understood, for example, by the large band of white space that they both share, covering the attributes of Gross Income, Cash, and Total Assets.
The invention described herein provides a method of displaying multidimensional data in a manner that depicts the actual spatial relationships between objects. The current art achieves a similar goal through a process called multidimensional similarity analysis. This practice uses mathematical techniques to arrange points on a two or three dimensional plot, such that the points represent objects in space and their arrangement approximates the spatial relationships of the actual objects. The graphical representation scheme, method of creation, and usability of multidimensional similarity analysis and this invention are very different however.
Cluster analysis as practiced in the current art is a mathematically robust technique for arranging objects and characteristics. It is very good at revealing complex multidimensional relationships to the human eye. It is less successful at its primary statistical purpose, which is the classification of the objects into a set of distinct groups. The classification problems are well documented in the literature, and often involve arbitrary divisions, empirically derived weighting factors, overlapping categories, and ad hoc abandonment of outlying data.
To illustrate these stubborn cluster analysis problems, imagine for a moment the Milky Way as viewed from the Earth. Cluster analysis could be applied to this view, where the objects are stars and their characteristics are X, Y, and Z dimensions in space or azimuth, altitude, and distance, to use Earth-centric measurements. In this case, however, we don't need cluster analysis because we can see the Milky Way. The relationships between stars are visually apparent. Yet we encounter difficult challenges if we want to group the stars into clusters. There are pairs of stars close together, triplets nearby, fuzzy clouds of stars embedded in constellations, pairs that only appear close together, long streams of stars containing smaller groups, et cetera. In short, there are groups at all sizes and scales, forming a continuum. Human perception recognizes groups, but these groups are defined based upon the purpose of the moment. Cluster analysis, lacking a purpose, cannot define a single all-encompassing grouping scheme.
These grouping problems only grow worse with increasing volumes of data. Today's relational database technology does not yet approach the millions of stars in the Milky Way, but it does support tables with hundreds of thousands of objects, each with dozens to hundreds of attributes. Information of this kind forms a continuum of objects and characteristics, it cannot be classified via a single model. Further, classification may obscure important relationships that may pique an investigator's interest. The very fact that a continuum, or spectrum, of values relates one group to another may be an important insight.
In FIG. 2 we elected to cut the cluster at a distance of 11, yet the selection of that particular distance is highly problematical. The literature describes a number of heuristic methods for selecting a cutting distance, but none are satisfactory in all, or even most, cases.
Thus, the visualization techniques of cluster analysis are its most important contribution, and the deterministic categorization techniques are less valuable. The current invention extends these visualization capabilities of cluster analysis, while eschewing its problematic categorization methods.
Multidimensional similarity (MDS) analysis provides a method of locating objects on a flat, or three dimensional plot, where the arrangement of the objects shows the associations, proximity, and geometric clouds or manifolds formed by the objects in multidimensional space. This technique, like the invention described herein, suffers from the distortion generated when higher dimensions are reduced to two or three. In practice MDS has little practical value when more than 50 objects of six or more dimensions are displayed because the plot becomes a confusing mass of points and object labels. Exploration of such a plot through computer interactive techniques--including rotation of the plot, projection onto a plane, and slicing at different levels and angles--helps comprehension, but only to a very limited extent. The methods of multidimensional similarity analysis cannot cope with hundreds of objects and dimensions, nor do they simultaneously display the attributes of each object as the invention described herein does.
The raw data for a cluster analysis is often provided by a relational database. The term relational has two separate meanings in this context: 1) It describes the relationship between the objects in a table and the characteristics of the objects. For example, as with FIG. 1, a table may contain companies as its objects, and financial characteristics of the companies as its fields, i.e., gross income, investment in R&D, inventory, long term debt, etc. The relationship here is that each object `owns` the characteristics in its record. 2) The term relational also defines the connections between the tables in a database. For example, one table may contain industry sectors, such as, energy, technology, medical instruments, etc., and the companies in the sectors and another may contain companies and their financial characteristics. These tables have a parent, sector,--child, company, relationship.
As just described, the fields or columns of a table are related to the object defined in the first column, and the tables of the database are related to one another. Significantly missing from the relational scheme is a way to relate records or objects in a table to one another. In the current art, records or rows are typically inserted or returned via query in arbitrary, or random order, or ordered by the time they were initially created. A meaningful order is often applied to the records at the time of the query, usually involving alphabetic or numeric sorting of the objects or the values in one or more fields. The information in FIG. 1 is a good example. It is sorted in ascending order based upon the Sales in the prior 12 months. No other relationships are reflected in the order of the companies or their financial data.
While sorting is informative and useful for very specific purposes, this method of relating records to one another is primitive compared to a complete analysis such as cluster analysis that simultaneously considers all of the object characteristics.
A general purpose relationship between the records can only be achieved with a multivariate approach. Such methods--including cluster analysis, discriminant analysis, analysis of variance, and multidimensional similarity analysis--are regularly used in statistical fields, but are not yet a standard part of the relational database technology.
The problem of understanding the content of large tables of data grows more significant as technology provides methods to quickly insert and update thousands of records. Large quantities of untapped knowledge are hidden in these tables simply because querying, sorting, reporting, and browsing techniques do not satisfy current needs. The current art obscures the meaningful content through a variety of practices, including the use of abbreviated attribute names, and little or no documentation of the legal range of values of the characteristics. Often, only the database designer has sufficient knowledge of the structure and content of a relational database to construct a proper query. Those with the most to gain from study of the information in a database, i.e., administrators, scientists, managers are often unable to explore it in a meaningful way.