The discussion of any work, publications, sales, or activity anywhere in this submission, including in any documents submitted with this application, shall not be taken as an admission that any such work constitutes prior art. The discussion of any activity, work, or publication herein is not an admission that such activity, work, or publication existed or was known in any particular jurisdiction.
Flow cytometry (FC) systems are standard pieces of equipment in various biological investigation settings. A typical FC experiment collects data from thousands of cells, with each cell labeled with a number of detectable markers usually indicating specific cell-surface proteins, but also potentially labeling other cellular features. Because FC systems can quickly gather features from thousands to hundreds of thousands of cells, an FC experiment can quickly gather huge datasets from cell samples. Such analysis has accelerated many types of investigation.
FC analysis systems can include a number of components for preparing cells or samples, capturing feature data from cells, sorting cells, and providing one or more outputs or taking further actions in response to the data. While methods and components of such systems vary, operation of automated or semi-automated FC data collection and analysis systems is extremely well-know in various biologic, medical, and forensic fields. While FCs are most commonly used today to collect features related to cell-surface proteins or receptors, FC technology is increasingly being use or considered for other applications, such as protein analysis, chemical analysis, etc.
A critical component of FC systems are computational tools that are applied to the features collected by the systems. Because a single FC experiment can include readings from thousands of cells, with up to about 30 features in recent FC systems available for each cell samples, automated compilation and analysis of FC data is an important component of FC data systems. This analysis may be performed by information enabled laboratory equipment used to gather FC data or the laboratory equipment can collect and store the feature data in a digital recording medium and those data can be read and processed at a later time by FC analysis systems. A number of data sets from FC experiments are published and are used to evaluate and validate new methods for analysis of FC feature data.
Multiparametric single-cell analysis has advanced understanding of diverse biological and pathological processes including cellular differentiation, intracellular signaling cascades and clinical immunophenotyping. Analysis by flow cytometry in increasingly used to analyze intracellular markers (e.g., phosphorylated proteins) for drug targeting and the identification of rare stem cell populations. Current flow cytometers typically provide simultaneous single-cell measurements of up to 12 fluorescent parameters in routine cases, and analysis of up to 19 parameters has been reported [1]. Recently, a commercially available next-generation mass cytometry platform (CyTOF™, DVS Sciences Inc., Toronto, ON, Canada) has become available and that allows routine measurement of 30 or more single-cell parameters [2]. Despite increasing research in cytometric analysis and the technological advances in acquiring an increasing number of parameters per single cell, methods for analyzing multidimensional single-cell data remain inadequate.
Flow cytometry simultaneously measures multiple proteins of individual cells. In typical flow cytometry studies, surface proteins are labeled with fluorescent dyes to generate fluorescent signals. Multiple colors of fluorescent labels can be used to stain multiple markers. After staining, individual cells generally are held in a thin stream of fluid and then passed through one or more laser detectors, which give measurements of size, granularity, and intensities of the fluorescent labels on a single cell basis. Flow cytometry is able to process up to 7000 cells per second, generating large datasets containing measurements of multiple protein markers on a large number of cells. Thus, Flow cytometry captures the heterogeneity of biological systems by providing multiparametric measurements of individual cells. Traditional analysis of cytometry datasets is a subjective process that requires intimate familiarity with the biological system.
Thus, in various fields, methods for exploring or analyzing data sets where a large number of samples (e.g., typically about 103 to about 108 cell samples, though any number of samples can be analyzed) are each characterized by a moderate number of features (e.g., typically 5 to about 30) remain limited.
Traditional methods for flow cytometry data analysis are often subjective and labor-intensive processes that require expert knowledge of the underlying cellular phenotypes. One common but cumbersome step is the selection of subsets of cells in a process called gating [3]. A gate is a region, defined in a biaxial plot of two measurements, which is used to select cells with a desired phenotype for downstream analysis. Gates are either manually drawn, for example, using software such as FlowJo (www(.)treestar(.)com/), FlowCore [4], or automatically defined by clustering algorithms [5, 6, 7, 8, 9, 10]. Manual gating is highly subjective and dependent on the investigator's knowledge and interpretation of the experiment. Automatic gating algorithms cluster cells by optimizing the objective that cells in the same cluster be more similar to each other than cells from other clusters. Because these algorithms strive to define maximally different clusters, they often miss the underlying continuity of phenotypes (progression) that is inherent in cellular differentiation [11].
Furthermore, optimization objectives of most automatic gating algorithms are predisposed to capture usually the most abundant cell populations, while rare cell types, such as stem cells, are either excluded as outliers or absorbed by larger clusters. Some algorithms, such as a recent approach for automated gating termed SamSPECTRAL provides a solution for rare cell type identification [12].
Traditional cytometry data analysis methods also commonly suffer from limitations in scalability and visualization with increasing numbers of measurements per single cell. These limitations are more acute as the data dimensionality increases. Currently, to fully visualize an m-dimensional flow dataset, ½ m*(m−1) biaxial plots are needed, where each biaxial plot displays the correlation of only two markers at one time. It is difficult to comprehend the correlations among three or more markers from a series of biaxial plots. One recent approach that partly addresses the scalability issue is the probability state model, implemented in the Gemstone™ software package (such as from Verity Software House, Inc.). This approach rearranges cells into a non-branching linear order, according to investigator's expert knowledge of how known markers behave during the progression underlying the measured cell population [13]. Because cells are ordered in a non-branching fashion, a new model is constructed for each mutually exclusive cell type (e.g., T cells, B cells).
Flow cytometry data can be displayed using one-parameter histograms or two-parameter scatter plots, based on which gating is performed. A gate is a user-defined region either in the one-parameter histograms or in the two-parameter scatter plots, which can be used to exclude irrelevant cells and select subpopulation of cells of interest. After gating, subsequent analyses are performed to identify cell subpopulations and relevant surface markers, based on the two-parameter scatter plots.
One recent advancement that partly addresses the issue of parameter scalability and visualization is a ribbon plot of cells arranged into a linear order, for example as implemented in the “Probability State Model” of the Gemstone® software. This approach rearranges cells into a linear order according to user's expert inputs. Given a flow cytometry dataset containing tens of thousands of cells, this approach asks the user to specify one marker and how it changes during the progression underlying the data. Cells are then ordered linearly according to the change of the marker specified by the user. The user is able to refine this order by sequentially specifying more markers and their changes. Once the cells are ordered, changes of all the measured markers can be visualized in one single figure, a “ribbon plot.” FIG. 1 illustrates an example of a ribbon plot of linearly ordered cells used to illustrate and analyze B cell progression according to the prior art. Although this approach scales well as the number of parameters increases, it has two disadvantages. First, it requires a user's knowledge of which markers change and how these markers change during the progression underlying the measured cell population. The approach is not able to automatically identify relevant markers or discover an unknown progression order or cellular differentiation or hierarchy. Second, differentiation and branching underlying the measured cell population cannot be represented by a linear order of the cells. Therefore, if the user does not have prior knowledge about the progression underlying the measured cell population, or if there exists differentiation and branching, the approach fails.