Populations of sample data having multivariate distributions, which, e.g., have no known normalized distribution shape, can be very hard to analyze. Flow cytometry allows to measure simultaneously multiple characteristics of thousands of cells. This ability has made flow cytometry a prevalent instrument in both the research and clinical settings. However, the full potential of this technology is limited by the lack of data analysis methodology and software that allows for an automated and objective analysis of the data generated by flow cytometry instruments. While the disclosed subject matter is explained generally in the context of analyzing flow cytometry data, as noted, there is a much broader application than flow cytometry, including any large distribution of multivariate data, such as census data, data regarding which stores and products shoppers at a shopping mall are attracted to on a given day or in a given time period, and the like.
One important part of the analysis of flow cytometry data is gating, i.e., the identification of homogenous subpopulations of cells. The current standard technique for this type of analysis is to draw 2D gates manually with a mouse on a computer screen, based on the user's interpretation of density contour lines that are provided by software tools such as FlowJo (at treestar.com) or BioConductor (Gentleman, R., et al. Genome Biology, 5:R80, 2004). The cells falling in this gate are extracted and the process is repeated for different 2D projections of the gated cells, thus resulting in a sequence of two-dimensional gates that describe subpopulations of the multivariate flow cytometry data.
There are several obvious problems with this kind of analysis. The analysis is subjective since it is based on the user's interpretation and experience, error prone, difficult to reproduce, time consuming, and does not scale to a high-throughput setting. Accordingly, manual gating has become a major limiting aspect of flow cytometery, and there is a widely recognized need for more advanced analysis techniques.
There have been several attempts to produce automatic and objective gates. These methods employ the k-means algorithm (Murphy, R. F., Cytometry, 6(4):302-309, 1985; Bakker, T. C., et al. Cytometry, 14(6):649-659, 1993; Demers, S., et al., Cytometry, 13(3):291-298, 1992) or mixture models with Gaussian components (Boedigheimer, M. J., and Ferbas, J., Cytometry A, 73(5):421-429, 2008) or with t components and a Box-Cox transformation (Lo, K., et al. Cytometry A, 73(4):321-322, 2008). A commonly shared limitation of these methods is that they produce necessarily convex subpopulations; whereas occasionally subpopulations occur that are not convex and are, for example, kidney shaped. Such subpopulations can arise, for example, when two markers are added sequentially, so that there is a developmental progression over time that moves the subpopulation first in one direction and then in another direction.
Density-Based Merging
Density-based merging (DBM) is grounded in nonparametric statistical theory which allows for such subpopulations. DBM follows the paradigm that clusters of the data can be delineated by the contours of high-density regions; this also is the rationale that underlies manual gating. The paradigm is implemented algorithmically (Walther, G., et al., Advances in Bioinformatics, Vol. 2009, Article ID: 686759) by constructing a grid with associated weights that are derived by binning the data. This grid provides for a fast computation density estimated via the Fast Fourier Transform, and it provides for an economical but flexible representation of clusters. Each high-density region is modeled by a collection of grid points (Walther, G., et al., Advances in Bioinformatics, Vol. 2009, Article ID: 686759). This collection is determined as follows: 1) links are established between certain neighboring grid points based on statistical decisions regarding the gradient density estimate. The aim is to connect neighboring grid points by a chain of links that follow the density surface “uphill.” The result is a number of chains that link certain grid points and which either terminate at the mode of a cluster or represent background that will not be assigned to a cluster; 2) the algorithm then will combine some of these chains if statistical procedures indicate that they represent the same cluster. The end result of the algorithm results in clusters that are represented by chains that link certain grid points. This representation provides an efficient data structure for visualizing and extracting the cells that belong to a cluster (Walther, G., et al., Advances in Bioinformatics, Vol. 2009, Article ID: 686759). The chains that link grid points in a cluster represent a tree structure which can be traversed backwards to efficiently enumerate all grid points in the cluster and hence to retrieve all cells in the cluster via their nearest neighbor grid point. Software implementation of DBM can allow for automatic 2D gating that is based on statistical theory and provide the information necessary to decide on the number of populations in the sample (Walther, G., et al., Advances in Bioinformatics, Vol. 2009, Article ID: 686759).
Probability Binning Comparison
Probability binning comparison of univariate distributions utilizes a variant of the chi-squared statistic (Roederer, M., et al. Cytometry, 45:37-46, 2001). In this variant, a control distribution is divided such that an equal number of events fall into each of the divisions, or bins. This approach minimizes the maximum expected variance for the control distribution and can be considered a mini-max algorithm. The control-derived bins then are applied to test sample distributions, and a normalized chi-squared value is computed (Roederer, M., et al. Cytometry, 45:37-46, 2001). While several algorithms for the comparison of univariate distributions arising from flow cytometry analyses have been developed and studied, algorithms for comparing multivariate distributions remain elusive. For example, algorithms, such as Kolmogorov-Smirnoff (K-S) statistics (Finch, P. D., J. Histochem. Cytochem. 27:800, 1979; Young, I. T., J. Histochem. Cytochem. 25:935-941, 1977), Overton subtraction (Overton, W. R., Cytometry, 9:619-626, 1988), SED (Bagwell, C., Clin. Immunol. Newsletter, 16:33-37, 1996), and some parametric models (Lampariello, F., Cytometry, 15:394-301, 1994; Lampariello, F., and Aiello, A., Cytometry, 32:241-254, 1998) are limited to univariate data. These types of algorithms commonly are utilized for their ability to determine whether or not given distributions are (statistically significantly) different, and/or to determine the fraction of events that are positive compared to a control distribution.
Univariate tools generally are sufficient with 3- or 4-parameter data (due to the relative independence of the measurements), however, they generally are never sufficient when the number of parameters is greater than five or six. Data analysis becomes significantly more difficult since the complexity of the data (and the analysis) increases geometrically with the number of parameters; these difficulties are becoming more apparent as the utilization of six or more color analysis becomes more routine, (i.e., 8+ parameter data).
Probability binning comparison also provides a useful metric for determining the probability with which two or more multivariate distributions represent distinct sets of data (Roederer, M., et al. Cytometry, 45:47-55, 2001). The metric can be used to identify the similarity or dissimilarity of samples. In this approach, hyper rectangles of n dimensions (where n is the number of measurements being compared) comprise the bins used for the chi-squared statistic. These hyper-dimensional bins are constructed such that the control sample has the same number of events in each bin; the bins are then applied to the test samples for chi-squared calculations (Roederer, M., et al. Cytometry, 45:47-55, 2001).
Fundamentally, the multivariate PB algorithm is identical to the univariate PB algorithm. The unique aspect of the multivariate algorithm is how the data is divided into bins on which the comparison is performed. As for the univariate comparison, the algorithm chosen results in a binning that equally divides the control distribution, i.e., any single event from the control distribution, selected at random, has the same probability of falling into any of the given bins.