1. Field of the Invention
The present invention relates generally to data analysis and, more particularly, to a computer-implemented method for analyzing multivariate data comprising a plurality of samples each having a plurality of measurement variables.
2. Description of the Background
Many technical fields require complex data analyses of large datasets, including multivariate datasets (involving a large number of measured variables). Often the goal of such analyses is to identify hidden structures or relationships between the measured samples of the measurement variables. Where the datasets are extremely large finding hidden structures and/or relationships may take excessive time on existing computer hardware, or may not be possible at all due to limited hardware resources of conventional computers.
There are different approaches for performing analysis and computation on numbers and other datasets. Arguably the largest and most pervasive approach is that of the axis-based virtual coordinate assignment protocol. This comprises a data storage table linked to a means of interrelating the data for visualization and computation (e.g., a scatter plot) even if the coordinate framework is implicit. The coordinate-based systems apply tables to store data, and axial-based constructs defined by scales are the representations of the data tables that show relationships to the data. The axis is thus the intermediary that interrelates data, and this permits data analysis and computation on the data. Every datum is related indirectly to other data via a relationship established with an axis with an established distance metric. As such, it is a device, and axes and dimensions do not necessarily represent any physical or natural manifestations of distance when using variables that lack distance values, for instance temperature. The relationship between an axis and the axial delineations representing different lengths or values that can be chosen to be linear or non-linear, and the numbers themselves can be integers, real or complex. The simplest is a single column of data quantifying measurements of a single variable that is then displayed as a diagram with a single axis and a scale that is a one-dimensional representation like a timeline. Two-dimensional orthogonal axes were developed to apply to geometry and broadened with the representation of space as a three-dimensional manifold described with coordinate system using x,y,z notation or polar coordinates. The geometrical system has been adapted so that any variable could be represented by an axis representing a dimension whether or not it represents spatial information. It has been expanded by using more than three dimensions to encompass and interrelate larger numbers of variables that are usually considered orthogonal but with the potential for varying degrees of correlation. The data table consisting of columns of variables and rows of values can be represented, for instance, as a scatter plot. In practice, this plot reinforces the notion that data occurs on a continuous manifold where each datum is positioned in respect to each of the coordinate axes and thus indirectly via the axes to each other by a distance metric. There are major advantages to this. The basis for storage is the most compact because n data instances can be stored in a table of on the order of proportional to size n. The coordinate system joins data by proximity based on metrics. However, there are also limitations. The human ability to visualize is limited to three dimensions, but the application of additional dimensions beyond three may be necessary to increase the number of variables to apply to, for instance, many dynamical processes (e.g. fluid flow). Visualizations beyond three dimensions is not intuitive. Compression of dimensions is the process of reducing the dimensions that takes advantage of redundant or correlated variables that add no significant information content. Unfortunately, compression based on statistics and functions often loses or distorts information.
The second major limitation of axis-based virtual coordinate assignment protocol is the use of an axis as an intermediary to relate data. This enables relative position and distance measurements to be made relative to the axis. Usually this involves a geometric functional relationship such as the Pythagorean theorem in which x2+y2=z2. For path-dependent calculations, this can be computationally problematic. Uncertainty in relating data must be accounted for in terms of accuracy and precision in relationship to the axes. Heteroscedasticity is another issue in which non-linear behavior exists, especially in high-dimensional data sets. High dimensional data sets are by definition sparse, but smooth axis-based systems require dense data and often impractical levels of data collection to achieve statistically valid or useful interpolation or prediction. Each datum must contain information related to each axis to provide a position on the manifold. Missing or erroneous data attributes are not tolerated well with these constructs. For instance, if a datum involves three attributes (e.g., values of x, y and z), and the value of the z attribute is erroneously missing or different than the true value, the spatial position of the point in a scatter plot could be at significant variance with the true value.
Stemming from the use of axes conceptually is the application of regression-based statistical processes to relate data for analysis and prediction. This is at its simplest mapping the data to a line, curve or surface in the data space. Large data requirements are often necessary for statistical validity, but large sets usually are accompanied by noise and errors ascribed usually to accuracy and precision with respect to the measurement axis. Because of the distortion statistical performance can be negatively impacted due to introduced uncertainty between the statistical model and the data. Data cleansing (removing undesirable data) and appending data can be challenging because the approaches used by regression require significant re-calculation. This is because regression usually involves evaluating every datum with respect to the sum of the whole (e.g. using a mean value).
The application of functions to represent compactly the behavior of data on manifolds is also problematic. The same heteroscedasticity, issues of uncertainty, non-linearity, and non-continuity of many real systems present problems for applying functions. Many real systems exhibit path dependency that results in, for instance, chaotic behaviors resulting in bifurcation (two potential outputs for a given input), which is not conducive to functional description. Functions can be developed that have accuracy over only small regions of the problem space. Some functions can be developed that require integration, differentiation or other complex methods to solve in order to generate predictions, but the mathematical function is too complex or impossible to solve without approximations or possibly invalid assumptions.
Another problem is the use of algorithms on the data in this form operate inefficiently with large data sets. Search routines to find, for example, a global maximum must evaluate all of the data instances individually to distinguish local maxima from the global one. For large data sets, this becomes computationally challenging.
A second major approach to data analysis distinctive from the coordinate-based approach, graph theory, has become an indispensable tool in studying complex datasets, and a graph system can exist that is an analog to the coordinate geometry system to perform analysis and computation. Graphs have the potential near-universal applicability to data analysis. Washio, Takashi and Hiroshi Motoda, State of the Art of Graph-based Data Mining, SIGKDD Explorations. 5:59-68 (2003). Ordinary graphs are the predominant type, but bipartite graphs have been shown to be more robust as a description of real entities. A bipartite graph or “bigraph” is a set of graph vertices decomposed into two disjoint sets such that no two graph vertices within the same set are adjacent. The multivariate approach to generate the bipartite graph from an attribute table is detailed in De Leeux, Jan and Michaildis, George, Data Visualization Through Graph Drawing, Comput. Statist., Vol. 16, pp. 435-450 (2001). Bipartite graphs (or bipartite matrices) offer a means of representing information for analysis but it is not particularly intuitive for human viewing because of the missing distance metric. Large numbers of correspondences, links between the disjoint sets, can make evaluating relationships within data difficult, and statistical analysis is generally simpler when performed on ordinary graphs.
Bipartite matrices and bigraphs can be converted to an ordinary graph by “mode reduction” where nodes (aka “vertices”) of one mode become the vertex or node of the ordinary graph. Shared correspondences occur when multiple objects in the first disjoint set share attributes in the second disjoint set. Shared correspondences are used as the basis for links or “edges” within the ordinary graph. An ordinary graph is a visual representation of an adjacency matrix. Again, the concept of distance between nodes of an ordinary graph, as with a bipartite graph, does not represent a distance metric as established with coordinate geometrical techniques. Links or edges represent relationships that can be directed, weighted, or unweighted. However, there is a general problem with mode reduction in that either the correspondences are too dense, too many to manage, or too sparse and fragmented, which results in a graph that is not visually appealing, too difficult to render, or too big to manage. Various approaches to reducing dense graphs have been applied including filtering links randomly or based on limiting the degree (number of links sharing a common node) of nodes within the graph. This risks losing information and distorting the graph as well as any subsequent statistical assessment of it. Furthermore, techniques for mode reduction of multivariate bipartite graphs have not been established that enable edges to represent different variables with distance metrics. Thus, ordinary graphs have been considered poor alternatives for managing multiple variables and multivariate data.
Ordinary graphs containing multivariate components are sometimes placed in a statistical coordinate system and converted to a spatial representation statistically (2 or 3 dimensions) through a statistical compression algorithm such as Principal Component Analysis to achieve a axis-based distance metric between data with the subsequent distortion and loss of information. The major problem with ordinary graphs is the concept of distance. Two nodes not directly inter-linked or joined by a common edge are related in terms of a quasi-distance by the minimum number of links or the average number of hops, but this can be complicated by directed edges or edge weighting. Furthermore, this path dependency might involve evaluating every possible path or some statistically value number of them to establish the shortest path. This can become computationally intractable for large data sets. No system of applying a physical distance inherent within a data set analogous to that of coordinate systems has been devised without some sort of statistical compromise as described above.
The concept of all-to-all weighted graphs representing relative distances between all nodes has been considered that would enable said distance metrics to be applied, but as mentioned this has remained computationally impossible for any but relatively small data sets. The simultaneous linkage of every node to every other node becomes computationally challenging for large sets of nodes because the number of required relationships increase proportionately to the square of the number of vertices. The calculations to determine each edge distance requires some exponential set of measurements. As mentioned above, each distance would require the measurement of every possible series of pathways to establish a minimum path length. The visualization of such a graph would be unappealing for large data sets because of the clutter of so many relationships. Navigation and statistical analysis would be excessively challenging. Dealing with more than one variable would be problematic because of the potential for differing distance metrics and weighting, which would require blending or some sort of statistical filtering.
Two major limitations have hindered the development of a graph theory-based analog to coordinate geometry. For one, a satisfactory distance metric that is not path dependent has not been established analogous to that in coordinate geometry. The second hindrance has been a lack of means to handle more than a few variables with the same type of ordinary graph that distances must be evaluated on. As a result of these shortcomings graphs have not been used as an alternative to coordinate geometry to perform computation. The current invention is a graph analytical process that solves this.