The present invention relates generally to the analysis of data. More particularly, the present invention pertains to data mining of multi-dimensional data such as biological based data, e.g., flow cytometric data, microarray data, cell sorting data, imaging data, etc.
Analysis of multi-dimensional data presents various problems. For example, when using visualization to analyze such data, the viewing of multi-dimensional data as combinations of lower dimensional views (e.g., histograms and bivariate displays) results in the loss of information in an uncontrollable manner. Such loss of information frequently leads to loss of the pertinent information necessary to accomplish desired goals, e.g., isolation of cell subpopulations when analyzing flow cytometry data, loss of resolution in imaging applications, etc. For example, valuable information is lost when bivariate displays are reduced to a pair of uncorrelated histograms (e.g., representing data whose correlations between parameters are unknown).
Such loss of valuable information also occurs whenever multi-dimensional data is reduced to a collection of bivariate displays in an attempt to provide human visualization of higher dimensional data space. For example, when a 3D object (or data object) is projected down as a projection “shadow” onto a 2D plane, the shape of this projection shadow depends on the projection angle and may grossly distort the actual shape of the 3D object (or data object). While less immediately evident, the reduction of higher dimensional data space to 3D views, as is performed with various 3D renderings, is subject to the same problem, namely loss of information about the shape of the data objects in higher dimensional data space. Hence, while some reduction of dimensionality may be required to produce human visualizable 2-D and 3-D displays, the conventional process of displaying all possible combinations of bivariate displays only provides one human visualizable viewpoint and is subject to distortions which range from slight to major distortions, depending on how the data is projected down onto a lower dimensional surface (e.g., plane).
Multivariate statistics offer us various data analysis alternatives (e.g., principal component analysis (PCA) or discriminant function analysis (DFA)) which give additional viewpoints and projection of multi-dimensional data. For example, PCA and DFA, along with interactive 3-D displays, have been used to select better projection angles for less distortion than simple collections of bivariate views of the data to provide optimal analysis parameters for sorting gates in flow cytometry systems. Such interactive 3-D visualization can be used to provide for conventional cluster analysis, e.g., whereby data clusters can be identified in multi-dimensional data space and intelligent starting values for the clusters can be provided as inputs for a guided cluster analysis. However, such data analysis frequently requires human visualization of multi-dimensional data in order to determine the appropriate number and/or location of clusters which can be difficult even with the aid of a multivariate statistical methods such as DFA and PCA. Ultimately, there is need for a data analysis method which does not require human visualization methods nor direct human supervision of multi-dimensional data analysis.
In addition to a need for improved data analysis, there is a need for ways of comparing multi-dimensional data sets (e.g., test and control multi-dimensional data sets) and providing knowledge discovery which can point out the important differences or similarities between the two or more multi-dimensional data sets without direct human intervention. Such methods of knowledge discovery can be referred to generally as data mining. In other words, data mining generally refers to any set of methods or apparatus that look for meaning in data as opposed to merely manipulating or performing computations on the data. Meaning in data basically refers to anything that helps to guide one skilled in the art to what is important or significant about the data, or suggest an approach for further analysis of the data.
Conventional clustering is one type of data mining method. Conventional clustering usually involves computation of distances between data points in multi-dimensional space. Generally, assignment of a given data point to a cluster is usually done by a least squares, or other, minimization method (e.g., minimization of distances from the centers of clusters) which either ignores correlations between the parameters of a data point (e.g., such as in K-means clustering) or tries to take these correlations into account (e.g., such as in Mahalanobis clustering). Such procedures when dealing with multi-dimensional data are very computationally intensive. For example, it is computationally challenging to cluster multi-parameter flow cytometric data including more than about 50,000 cells. As such, this makes cluster analysis difficult or even impossible to perform on rare cell subpopulations (i.e., as used herein rare subpopulations are present at a percentage in the range of about 1.0% to 0.0001%) which require data sets containing many millions of cells (e.g., data points) to obtain enough rare cell data. Necessarily, the rare cells still tend to be sparse in their distribution in higher dimensional data space; one of the features of the so-called “tyranny of dimensionality further described below.
Data mining is of particular interest in flow cytometry/cell sorting applications. For example, the general problem of cell sorting is to identify cell subpopulations of interest and to isolate them. Typically, the usual technique is to display various histograms and bivariate views of the full cytometric data obtained from running a sample of cells through a flow cytometer. One skilled in the art then uses human pattern recognition to identify cell subpopulations of interest. Through interactive 2-D displays, various regions are manually set by the one skilled in the art. These regions can then be combined through appropriate Boolean logic to set a gate which is then used as a sort decision boundary in a cell sorting process.
Under simple circumstances, this procedure is adequate to sort the cells of interest. However, the method suffers from various shortcomings. First, in order to provide human visualization on a 2-D computer system, the multi-dimensional data is reduced to simple histograms and bivariate displays as described previously herein. Higher dimensional data is projected down onto these 2-D displays, losing valuable information about the cell subpopulations that was available, but not readily visualizable in the original multi-dimensional data space. Second, the choice of regions and sort gates is usually arbitrary and subject to individual experimenter expertise and prejudices.
Many flow cytometry applications require the measurement and sorting of small subpopulations of cells (or other objects/particles) from large populations. These applications require a multi-dimensional/high resolution flow cytometry experiment and often produce data sets in which the data do not always cluster into discrete subpopulations amenable to visual pattern recognition or to conventional cluster analysis techniques.
Flow cytometric data, e.g., both flow and image cytometric data, include a large number of sets of measurements taken on single cells. This multi-parameter data on populations of single cells defines a multi-dimensional data space where there tend to be some natural groupings of multi-dimensional data points which mirror the similarities of cells belonging to particular cell subpopulations. While the multi-dimensional data points do show some grouping, the probability for all of the measurements being identical for even two cells within the measurement resolution (e.g., 8-bit to 12-bit resolution for each measurement) is very small. Hence, any groupings depend on some index, typically a multi-dimensional distance metric, to classify two cells within a similar group or cluster.
Over the years, classification of cells from rare and complex cell populations have utilized a wide variety of classifier systems such as multi-dimensional cluster analysis. However, methods such as conventional cluster analysis, 3-D visualization techniques, guided cluster analysis, and fuzzy clustering routines are, computationally intensive as described above.
In many flow cytometric data sets, the data do not cluster into discrete subpopulations that can easily be identified by human pattern recognition. To satisfy both the demands for very large data sets and amorphous data structure, rapid pre-classification techniques have been used.
Although cluster analysis methods have been developed over the years to perform this grouping of cells into cell subpopulations, a more complex problem of comparing clusters between different data files is evident. While clustering data points within one data file is certainly of interest, comparison of data between samples (e.g., treated versus untreated controls, cancer versus normal cells, etc.) is of even heightened importance.
Generally, what one skilled in the art would like to know is what is different between two cell samples. That question necessitates comparing the clusters within two separate data files. Since cluster positions can vary slightly, which is a common problem encountered in this situation, measures of similarity or dissimilarity need to take such variation into account. Due to the scarcity of data points in multi-dimensional data space, a much more complex analysis than a simple difference spectrum is required for comparison of the multi-dimensional data sets. This is particularly the case since data points in different data sets are typically only similar, and not exact.
In other words, in general, an illustrative problem when dealing with flow cytometric data is to take two data sets, the first with cell population “A” and the second with the populations “A and B,” analyze such data sets, and then produce the data set with population “B” only. Clearly, the way to obtain “B” only is to subtract the “A” only data set from the “A and B” data set.
Such subtraction has previously been accomplished by others at the single parameter histogram level. However, extension of this method from single parameter histogram data to multi-dimensional data is complicated for various reasons. Such complications arise because the data is both multi-dimensional and high resolution. Attempting to deal with the data as a multi-dimensional array leads to severe storage and computational problems.
Traditionally, such situations dealing with multi-dimensional arrays have been called “the tyranny of dimensionality”. The rational for this tyranny is evident. For example, simple numerical integration in “n” dimensions requires kn terms in the sum, where k is the number of division points on each axis (e.g., bit resolution in digitized data); integration must be numerically performed to compute probabilities. For example, if the dimension of the data set (i.e., the number of variables of a multi-dimensional data set being measured) is 10, and each axis is divided into 1,000 intervals, then the number of terms is 1,00010 (i.e., 1030). As can be seen, this is a very large number of array elements.
Typically, when dealing with flow cytometric data, this tyranny of dimensionality has been addressed by reducing the resolution of the data. For example, multi-parameter 10-bit data (i.e., 1,024 channels) is typically expressed as a 6-bit×6-bit (i.e., 64×64) channel scattergram. For three-dimensional data displays, the data resolution is typically reduced to 5-bit (i.e., 32×32×32) channels. For two-dimensional data display/analysis for 10-bit or 12-bit data, a large amount of loss and resolution of the original data may be acceptable. However, for higher dimensional data display/analysis, a continued reduction in dimensionality will start squashing the data into such large channel bins that all resolution of different subpopulations is lost; effectively merging all of the data into a few meaningless clusters of data.
In other words, the size of the data arrays as described above expands exponentially with dimensionality. As such, a single parameter histogram at 10-bit resolution generates a data set of only 1,000 array elements, a three-parameter measurement at 10-bit resolution requires 109 elements, and a five-parameter measurement at 10-bit resolution requires 1015 elements. Thus, array element subtraction of the multi-dimensional data sets becomes problematic in terms of data storage and data processing cycle times.
In summary, although the subtraction problem has been addressed in simple bin to identical bin subtraction of histogram data, applying similar subtraction techniques to multi-dimensional data quickly becomes daunting due to both the exponential scaling of multi-dimensional data by the number of dimensions, and also due to the tyranny of dimensionality as described above whereby the data points become sparse and unique in multi-dimensional data space, such that no two points in similar clusters in two or more data sets are numerically identical.
As described above, although the subtraction problem has been addressed in some multi-dimensional applications by reducing the resolution of each data set, reduction of the data array to a computationally manageable size sacrifices resolution to the point where in a multi-dimensional data experiment, the data is merged into few and meaningless clusters and the ability to detect a subpopulation of cells is lost. Further, conventional clustering algorithms are computationally intensive and thus do not solve the problem of retaining resolution while reducing computational intensity, e.g., a problem readily apparent in analysis of multi-dimensional flow cytometric data. In addition, because the current methods are computationally intensive, they are prevented from being employed on lower performance computer platforms.
No current methodology is known in the flow cytometry field which can preserve the information content of multi-dimensional high resolution flow cytometric data, and thus capitalize on its inherent ability to measure and sort cell populations while reducing computational intensity to a manageable size compatible with cytometry hardware and analysis/sort cycle times. Further, there is no methodology in use for other fields which operate on two or more multi-dimensional data sets (e.g., imaging data or microarray data for genomic or proteomic experimentation) which can provide such data analysis while reducing computational intensity.
Additionally, current genomic and proteomic multi-dimensional data contains mixed data types containing different scaling factors (e.g., linear of log digitized micro-array spot intensities correlated to categorically identified human gene or protein names or signal transduction pathways). Conventional clustering algorithms find it difficult to compute distances with mixed data types with different scaling factors.