A common task for many applications is to compare data sets in order to distinguish two or more classes forming sub-populations of those data. One example of such an application involves the use of flow cytometry for medical diagnosis.
Flow cytometry can be used to measure properties related to individual cells in a sample of blood drawn from a patient. A liquid stream in the cytometer carries and aligns individual cells so that they pass through a laser beam in single file. As a cell passes through the laser beam, light is scattered from the cell surface. Photomultiplier tubes collect the light scattered in the forward and side directions which gives information related to the cell size and shape. This information may be used to identify the general type of cell (e.g. monocyte, lymphocyte, granulocyte.)
Additionally, fluorescent molecules (fluorophores) that can be conjugated with antibodies can be activated by the laser and emit light. Since these antibodies bind with antigens on the cells, the amount of light detected from the fluorophores is related to the number of antigens on the surface of the cell passing through the beam. The specific set of fluorescently tagged antibodies that is chosen can depend on the types of cells to be studied since different types of cells have different distributions of cell surface antigens. Several tagged antibodies are used simultaneously, so measurements made as one cell passes through the laser beam consist of scattered light intensities as well as light intensities from each of the fluorophores. Thus, the characterization of a single cell can consist of a set of measured light intensities that may be represented as a coordinate position in a multidimensional space. Considering only the light from the fluorophores, there is one coordinate axis corresponding to each of the fluorescently tagged antibodies. The number of coordinate axes (the dimension of the space) is the number of fluorophores used. Modern flow cytometers can measure several colors associated with different fluorophores and thousands of cells per second. Thus, the data from one subject can be described by a collection of measurements related to the number of antigens of certain types on individual cells for each of (typically) many thousands of individual cells.
By way of example, one would like to determine if a patient has a specific illness based on a set of objective measurements obtained from a blood sample that is analyzed with a flow cytometer. The terminology used to describe data is as follows. One case (e.g. the flow cytometric data derived from a blood sample taken from a patient) is called a “sample instance.” (The terms “instance” and “sample” are also used.) Several sample instances may be associated with each other forming a class of instances such as the class of patients having a disease or the class of patients who are healthy. Multiple sets of measurements (e.g. the measured light intensities for each cell passing through the flow cytometer) can be made for one instance. Each of these sets of measurements can be referred to as an “event.” In terminology of the present invention, the data for an instance can consist of a distribution of points in a multidimensional space, with each point representing one event and with each coordinate of a point representing a measurement of light intensity from a single fluorophore. For example, FIG. 1 shows an example of flow cytometry data for four fluorescent parameters. Since humans cannot visualize a 4-dimensional space, these data are shown as pair-wise dot plots.
Data of the type described above, consisting of several thousand events (or points) in a multidimensional parameter space, is best described as a density function, i.e. the number of events contained in a volume of space. Often, this density function is normalized by the total number of events comprising the instance. If this density function is known for some population of instances, it can be used to specify the probability than an event will be found within some region of the parameter space for instances belonging to this population. In mathematical terminology this is referred to as a probability density function (PDF).
In the example of flow cytometry for medical diagnosis, each class of instances (e.g. diseased or healthy) has an associated multidimensional PDF. The problem that arises in diagnosis can be that of determining the PDF for two or more classes of instances, measuring the density of events for a newly observed instance, and by comparing these distributions, assigning the newly observed instance to a class. Thus, accurately representing multidimensional data in such a way as to enable this classification is critical.
Flow cytometry has been in use as a clinical tool for many years (Johnson 1993 and Jennings 1997). In many applications, an optimized panel of antibodies is used to identify specific cell types. If a cell of a specific type is present, the intensity measured for the corresponding fluorophore will be high (positive events); if it is not present, the intensity will be low (negative events). In this case, one can count cells of different types by applying a threshold to the signal such that the signal intensity for negative events falls below the threshold and the signal intensity for positive events falls above the threshold. For multiple antibodies, the flow cytometric space is divided into “quadrants” using these thresholds, and thus the numbers of cells in each quadrant can be counted.
An example is shown in FIG. 2 for T-lymphocytes. CD4 positive events indicate the presence of helper T cells that play a role in regulating immune response. CD8 positive events indicate the presence of cytotoxic T cells that destroy infected cells. The ratio of CD4 positives to CD8 positives is a measure of immune status and can be used to diagnose or follow the progression of HIV infection since the HIV virus targets helper T cells.
Flow cytometric quadrant analyses, as described above, are possible when the cell antigens and specific antibodies are well characterized. However, in cases where these are not known or cell surface markers change with time, the distributions of intensity levels from flow cytometry measurements are complex and thus a simple positive/negative analysis is not possible. An example of an especially important class of cells that are not well characterized is Circulating Endothelial Progenitor Cells (CEPCs). These cells play a key role in post-natal angiogenesis and vascular development. A method of cytometrically identifying CEPCs would be of great interest for diagnostics and therapeutics related to cardiovascular pathology and conditions involving neovascularization such as ischemia, diabetic retinopathy, and tumor growth.
Other methods of representing and analyzing multidimensional flow cytometry data have been developed. One that is most closely related to the herein described methods and apparatus is Probability Binning (Roederer 2001). Roederer's method of Probability Binning represents a multidimensional probability distribution as a set of bins defining regions of the multidimensional space. The boundaries of these bins are chosen so that approximately equal numbers of events lie in each bin. Bins are found recursively by selecting a coordinate dimension, determining the median in that coordinate, and subdividing the data such that events whose values for this coordinate are less than the median are placed in one bin while those whose values for this coordinate are greater than the median are placed in another bin. Dividing the data at the median insures that for each subdivision of a “parent” bin, the “children” bins have equal numbers of events (plus or minus one if the number of events in the parent bin is odd). These two children bins are then processed in a similar way, splitting the data into four bins. This recursive method is continued until the desired number of bins is obtained. The method used by Roederer et. al. to select the coordinate dimension at each subdivision is to calculate the variance of the data in the parent bin for all the coordinate dimensions and choose the dimension having the largest variance. It is important to note that this split always occurs on one of the coordinate axes of the data as originally presented. Thus, if the space is 4-dimensional, the data will be divided according to the coordinate corresponding to one of those four dimensions. Although the decision is made on the basis of the variance in each dimension, the split is not necessarily along the optimal direction since the direction of maximum variance may not coincide with one of the coordinate axes.
However, current practices and approaches fall short of providing efficient, robust, reliable and accurate systems of representing multidimensional data that can be used to address the herein discussed problems. From the foregoing, it is appreciated that there exists a need for methods and an apparatus that overcome the shortcomings of those existing previously.