Most flow and mass cytometry applications in biomedical studies are based on comparisons between/among control and test samples. Dissimilarities between/among samples may be due to drug treatment regime, progression of disease, response to therapies, etc. To study these dissimilarities across samples, the populations of cells in each sample may be clustered to reveal phenotypically distinct cell subsets that can then be matched and compared between samples. Despite the widespread use of flow and mass cytometry to evaluate outcomes in the laboratory and the clinic, current analysis methods for sample comparison and matching between samples still require further development to fully accommodate real-world flow/mass cytometry data. At present, methods for samples comparison and matching are either computationally expensive and affected by the curse of dimensionality or fail in the presence of small changes due to instrument noise, calibration, etc., that are very common in flow cytometry and similar type of data as explained below.
Traditionally cluster analysis of flow cytometry data has been done by manual gating of the data, which has proved effective in a gross sense but is both subjective and extremely laborious, particularly with current high-dimensional (Hi-D) (e.g., >6 measured parameters) data sets. The need to facilitate these analyses, and make them more accurate, has motivated development of automated or semiautomatic clustering and cluster matching methods for Hi-D flow and mass cytometry data.
Both of these tasks (cluster identification and cluster matching) are highly challenging because they are subject to the “curse of dimensionality”, a well-known statistical problem for Hi-D data that compromises both statistical validity and computational performance as described in Hastie, T., Tibshirani, R. & Friedman, J. Local methods in high dimensions in The elements of statistical learning. 22-27 (Springer-Verlag, 2009).
Existing methods address the cluster matching problem in two different ways, both of which have limitations. The first way is to cluster one sample at a time and align and match the cell subsets (clusters) present in multiple samples post clustering (e.g., Pyne, S. et al. Automated high-dimensional flow cytometric data analysis. Proc. Natl Acad Sci USA. 106, 8519-8524 (2009) (hereafter “Pyne et al., Proc. Natl Acad Sci, 2009”)). This conventional approach allows fast computational implementations in low dimensions. However, it can fail if the locations of the populations (clusters) significantly vary from sample to sample, or if populations disappear or appear between samples. When clustering is performed in Hi-D settings, this approach may also be compromised by the curse of dimensionality.
The second type of approach (e.g., see Lee, S. et al. Modeling of inter-sample variation in flow cytometric data with the joint clustering and matching procedure. Cytometry A. 89(1), 30-43 (2016) (hereafter “Lee et al., Cytometry part A, 2016”); Cron, A. et al. Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput Biol. 9(7), e1003130 (2013) (hereafter “Cron et al., PLoS Comput Biol, 2013”); and Dundar, M. et al. A non-parametric Bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects. BMC Bioinformatics. 15, 314 (2014) (hereafter “Dundar et al., BMC Bioinformatics, 2014”)) alleviates some of these problems by creating a Hi-D template of meta-clusters (distinct biologically relevant cell types) in which all sample data are pooled, simultaneously clustered, and then matched. With these methods, multiple samples are treated as different realizations of a single underlying model reflecting the biological reality.
Apart from being computationally expensive, the majority of methods that belong to this category identify clusters by fitting mathematical models to datasets. The feasibility of fitting in these case; however, is dramatically affected by the curse of dimensionality, because the number of combinations of possible parameters to be considered increases dramatically as the number of dimensions increases above three or four.
Thus, although the existing methods offer solutions to some aspects of the cluster-matching problem, they still do not fully accommodate real-world flow/mass cytometry data.
Conventional systems and methods also have drawbacks when being used for sequential gating of high dimensional flow data. Some conventional analysis applications used for sequential gating of Hi-D flow data include tools providing progressive two-dimensional (2D) views of the Hi-D flow data. However, conventional analysis applications do not provide the user with guidance for deciding which pair of reagents/markers may be a good candidate for the next recursive analysis round. Instead, it leaves these choices up to the user, who often must resort to trial and error in making these analysis choices, a frustrating process when a large number of reagents and/or fluorescence detectors are used to distinguish individual subsets.
Thus, there is a need for a systems and methods directed to improved cluster matching and user guidance for sequential gating of flow data, especially Hi-D flow data.