Flow cytometry is the measurement of characteristics of minute particles suspended in a flowing liquid stream. A focused beam of laser light illuminates each moving particle and light is scattered in all directions. Detectors placed forward of the intersection point or orthogonal to the laser beam receive the pulses of scattered light, generating signals which are input into a computer analyzer for interpretation. The total amount of forward scattered light detected depends on particle size and refractive index but is closely correlated with cross-sectional area of the particle as seen by the laser, whereas the amount of side scattered light can indicate shape or granularity.
One of the most widely used applications of flow cytometry is that of cellular analysis for medical diagnostics, where the particles of interest are cells suspended in a saline-containing solution. Further properties of the cell, such as surface molecules or intracellular constituents, can also be accurately quantitated if the cellular marker of interest can be labeled with a fluorescent dye; for example, an antibody-fluorescent dye conjugate may be used to attach to specific surface or intracellular receptors. Immunophenotyping by characterizing cells at different stages of development through the use of fluorescent-labeled monoclonal antibodies against surface markers is one of the most common applications of flow cytometry. Other dyes have been developed which bind to particular structures (e.g., DNA, mitochondria) or are sensitive to the local chemistry (e.g., Ca++ concentration, pH, etc.).
While flow cytometry is widely used in medical diagnostics, it is also useful in non-medical applications, such as water or other liquid analysis. For example, seawater may be analyzed to identify presence of or types of bacteria or other organisms, milk can be analyzed to test for microbes, and fuels may be tested for particulate contaminants or additives.
The laser beam that is used is of a suitable color to excite the fluorochrome or fluorochromes selected. The quantity of fluorescent light emitted can be correlated with the expression of the cellular marker in question. Each flow cytometer is usually able to detect many different fluorochromes simultaneously, depending on its configuration. In some instruments, multiple fluorochromes may be analyzed simultaneously by using multiple lasers emitting at different wavelengths. For example, the FACSCalibur™ flow cytometry system available from Becton Dickinson (Franklin Lakes, N.J.) is a multi-color flow cytometer that is configured for four-color operation. The fluorescence emission from each cell is collected by a series of photomultiplier tubes, and the subsequent electrical events are collected and analyzed on a computer that assigns a fluorescence intensity value to each signal in Flow Cytometry Standard (FCS) data files. Analysis of the data involves identifying intersections or unions of polygonal regions in hyperspace that are used to filter or “gate” data and define a subset of sub-population of events for further analysis or sorting.
The International Society for Analytical Cytology (ISAC) has adopted the FCS Data File Standard for the common representation of FCM data. This standard is supported by all of the major analytical instruments to record the measurements from a sample run through a cytometer, allowing researchers and clinicians to choose among a number of commercially-available instruments and software without encountering major data compatibility issues. However, this standard stops short of describing a protocol for computational post-processing and data analysis.
Because of the large amount of data present in a flow cytometry analysis, it is often difficult to fully utilize the data through a manual process. The high dimensionality of data also makes it infeasible to use traditional statistical methods and learning techniques such as artificial neural networks. The support vector machine is a kernel based machine learning technique capable of processing high dimensional data. It can be an effective tool in handling the flow data with an appropriately designed kernel.
Kernels play a critical role in modern machine learning technologies such as support vector machines (SVM). A support vector machine for classification is defined as an optimal hyperplane in a feature space, which is often a high dimensional (even infinite dimensional) inner product space. The construction of the optimal hyperplane requires the inner products, in the feature space, of mapped input vectors. A kernel function defined on the input space provides an effective way to compute the inner products without actually mapping the input to the feature space. The kernel defines a similarity measure between two vectors. An advantage of the SVM approach is its ability to harvest hidden patterns based solely on the natural similarity measure of the kernel, without using explicit feature extractions.
In many applications such as image recognition and flow cytometry data analysis, the input data are usually of high dimensions and in large quantities. The important features of such data are usually in the distributions of the points in certain spaces, rather than the isolated values of individual points. The standard kernels (e.g., polynomial kernels and Gaussian kernels) are often ineffective on this type of data because the standard kernels treat all vector components equally, so that the large input volumes tend to make the kernels insensitive to the underlying structures and the distributional features of the specific problems. As a result, they are not well suited for distributional data. For example, SVM analysis of flow cytometry data has been reported using radial basis function (RBF) kernels, examples of which are Gaussian and B-spline kernels. (See, Rajwa, B., et al., “Automated Classification of Bacterial Particles in Flow by Multiangle Scatter Measurement and Support Vector Machine Classifier”, Cytometry Part A, 73A:369-379 (2008).) The described method required the use of an “enhanced scatter-detection system” to obtain the reported high classification accuracy. Further, the authors concluded that the SVM results could not easily be interpreted if the dimensionality of the problem was higher than 2. Such a limitation minimizes the practical applications of such a technique. Toedling, et al. in “Automated in-silico detection of cell populations in flow cytometry readouts and its application to leukemia disease monitoring”, BMC Bioinformatics, 7:282, June 2006, describe SVM analysis of flow cytometry data using a radial basis function kernel to detect leukemic cells by assigning single cells to pre-defined groups. In effect, the SVM analysis takes the place of manual gating but does not take into account any distributional features of the data.
Accordingly, the need remains for a method for analysis of flow cytometry data and other types of distributional data that includes important information within the underlying structures and distribution and is capable of use with higher dimensionalities. The present invention is directed to such a method.