The present invention relates in general to statistical analysis of one or more data sample distributions, such as a histogram of photoluminescence data produced by a blood flow cytometer system, and is particularly directed to a signal processing operator that is operative to preserve statistical probability distribution characteristics for a quantized data set that is subjected to a dynamic range expansion transform, such as a logarithmic operator, which, without the statistical probability preservation mechanism, would introduce undesirable binning artifacts into the transformed data.
Flow cytometry derives from the quantitative measurement (meter) of structural features of biological cells (cyto) transported by a carrier in a controlled flow through a series of primarily optical detectors. Flow cytometers have been commercially available since the early 1970""s, and their use has been continuously increasing. The most numerous flow cytometers are those employed for complete blood cell counts in clinical laboratories. Flow cytometers are found in all major biological research institutions. They are also numerous in medical centers, where they are used for diagnosis as well as research. There are currently on the order of 7,000 flow cytometers in use worldwide. Chromosome count and cell cycle analysis of cancers is the major diagnostic use. Lymphomas and leukemia are intensively studied for surface markers of diagnostic and prognostic value. Flow cytometry,has been the method of choice for monitoring AIDs patients.
The general architecture of a flow cytometer system is shown diagrammatically in FIG. 1(a). Cells in suspension (retained in a saline carrier reservoir 10) are caused to flow one at a time (typically at rates of over 100 cells per second) through a transport medium 12 (such as a capillary tube 12). As the stream of cells flows through the capillary, the cells pass through an illumination region or window 14, one at the time, where they are illuminated by a focused optical output beam produced by a laser 16. Distributed around the illumination window are a plurality of optical sensors 18, that are located so as to intercept and measure the optical response of each cell to the laser beam illumination, including forward scatter intensity (proportional to cell diameter), orthogonal scatter intensity (proportional to cell granularity), and fluorescence intensities at various wavelengths. Each optical sensor""s measurement output is then digitized and coupled to a computer (signal processor) 20 for processing.
Because different cell types can be differentiated by the statistical properties of their measurements, flow cytometry can be use to separate and count different cell populations in a mixture (blood sample, for example). In addition, the cells can be stained with fluorescent reagents or dyes that bind to specific biochemical receptors of certain cells, allowing the measurement of biological and biochemical properties. The object of flow cytometry is to separate and quantify cell populations. Typically, the acquired data is accumulated into one or two dimensional data distributions so that the morphological variability of the distributions can be interpreted to distinguish the cellular populations.
In the current approaches for analyzing flow cytometric histograms, a pre-processing step known as the log-transformation, is employed to increase the dynamic range of the data distribution, in order to facilitate analysis and henceforth enhance interpretation. Unfortunately, although this step serves to broaden the dynamic range of the data distributions, it introduces an undesirable artifact, known as the xe2x80x9cpicket fencexe2x80x9d or xe2x80x9cbinning effectxe2x80x9d, that undermines the one aspect of the solution it is intended to reinforce.
This may be illustrated by reference to FIGS. 1(b) and 1(c), wherein FIG. 1(b) shows two overlapped Gaussian distributions as the original data, and FIG. 1(c) shows in discontinuous vertical bold black lines the binning or picket fence effect in the histogram of the log-transformed data; the continuous gray line in FIG. 1(c) is the ideal continuous transformation. This transformation data is beset with contentious aspects. Even if dynamically expanded, it introduces a significant binning artifact; moreover, the result from the discrete log transformation is not suitable for data analysis or any direct and consequential statistical analysis.
In an effort to counter the binning effect""s undesired artifacts, which skew the analysis/interpretation of the results, a number of practitioners have relied upon filtering techniques, that attempt to attenuate this undesired effect without introducing changes that might undermine the statistical values of the original data. However, these filtering approaches are fraught with an irreconcilable issues: that of relating the degree to which the artifact must be filtered with the point at which the filtered data is still considered to retain statistics similar to the original data.
More particularly, to resolve the binning effect issue, averaging schemes based on finite impulse response (FIR) filters are typically used. Most effective FIR filtering schemes are achieved using a traditional (1-2-1) 3-tap FIR filter. Such a filter has have positive attributes, provided that the following constraints are met: 1-maintaining the area under the curve requires that the sum of the FIR filter coefficients equals 1, and that the appropriate boundary conditions have been met; 2- to prevent the data from being skewed, the filter coefficients must be symmetric around the center (this requires that the number of filter taps in the FIR filter be odd), and the individual distributions must be symmetrical and not overlapping, which is not typically true when analyzing log-transformed data from cell populations.
FIGS. 2(a), 2(b), 2(c) and 2(d) respectively illustrate the effects of an FIR filtering scheme, after 20, 100, 200 and 500 passes of the binned histogram of FIG. 1(c) through a traditional (1-2-1) FIR filter. In each of these Figures, the bold black curve is the filtered histogram, while the gray curve shows the ideal transformed data. From qualitative analyses of the results of the FIR smoothing, one might infer that there is an optimal number of passes required for the closest approximation to the actual distribution. However, FIR filtering cannot be optimal, because the log transformation function has been applied to initially Gaussian populations, and the resultant log-normal populations will be skewed. Clearly, after 500 passes, this can be observed. Another factor that makes the use of FIR filtering techniques inappropriate is the fact that, depending on the physical properties of the cells and the particular sensors used, the cell populations may overlap, and therefore any excessive filtering will skew their statistical properties.
Other practitioners have attempted to ameliorate the binning effect by making use of log-amplifiers to electronically transform the input signal in its analog form before digitalization. This approach has a number of drawbacks: 1- it requires the use of additional and expensive hardware; 2- logarithmic amplifiers are notoriously noisy and unstable; and 3- when linear data is also required, the instruments must send and store twice the amount of data.
Another proposal to overcome the binning effect is to use high-resolution analog-to-digital converters (ADCs), in an effort to prevent the log transformation from exhibiting discontinuities in the lower range histogram channels. This approach is not perfect and has the following problems: 1- high-resolution ADCs converters are expensive; 2- the amount of data the instruments must send and store will increase proportionally with the increase in bit resolution of the ADCxe2x80x94another expensive proposition; and 3- no matter what the resolution of the ADC converter used, the binning effect, although minimized, will still be present in the output histogram. Of course, there is a point where the natural irregularities of the real data will be greater than the binning effect.
In accordance with the present invention, the above drawbacks of conventional signal processing mechanisms used to process histograms of data, especially data associated with biological cell populations, including human blood cells that have been subjected to flow cytometry processing, are effectively obviated by a new and improved, statistical probability distribution-preserving histogram transform mechanism. This inventive mechanism effectively preserves the statistical probability distribution characteristics of the original data that would otherwise be removed or lost in the course of expanding the dynamic range of a quantized histogram data set through the use of a conventional log transformation. With this new approach to data mapping, the need for filtering becomes a non-issue, since the binning effect is countered early in the process, thus preserving the statistical properties of populations under study.
As will be detailed hereinafter, the present inventors have determined that the binning effect results from the erroneous premise that an input data range [x0, x0+1) uniformly maps or is transformed to the output data range [t(x0), t(x0)+1), where t(x) is the transformation function. In actuality, however, when mapping digitized input data to the transformed domain, the quantization or discretization process carried out by an analog-to-digital converter implements a floor function, so that for the analog data range [x0, x0+1), the output will be the integer digital value x0.
If the probability distribution of the input data sample in the range [x0, x0+1) is known, then the probability distribution in the transformed domain can be determined. When input data is transformed from the input domain, X, to the output domain, Y, using a given transformation Y=t(X), the input probability distribution, fX(x), is transformed into fY(y). Because the transformation does not affect the probability of each event, the infinitesimal probability of any point x is the same at the output value y=t(x). In mathematical terms this may be expressed as fY(y)*dy=fX(x)*dx. By manipulation and substituting x by txe2x88x921(y), the relationship between the resulting probability distribution, fY(y), and the input probability distribution, fX(x), may be expressed as fY(Y)=fX(txe2x88x921(y))*d/dy(txe2x88x921(y)), where X and Y are two random variables related by the transformation Y=t(x), and t(x) is a monotonically increasing transformation function whose inverse txe2x88x921(y) has a continuous derivative on Y. This ensures that the areas under the input and output probability distributions are maintainedxe2x80x94an important requisite when analyzing cell populations.
For a typical log transformation, the transformed density distribution may be expressed as: g(y)=f(K*ey/s)*K*S*ey/s, where S and K are scaling constants defined as S=D/(Bxe2x88x921) and K=2R/10D, B is the number of bins in the output histogram, D is the number of desired decades, and R is the bit resolution of the ADC converter. In order to employ this transformed log density distribution expression, it is necessary to know the probability distribution of the input histogram, which is unknown, but can be approximated.
The binning effect may be considered to be the result of assuming that the input probability distribution at each ADC interval is a delta function positioned at each ADC output value. In actuality, however, this erroneous assumption is the least probable case when dealing with biological events, such as biological cell populations. To improve the accuracy of the log-transformation, a more realistic probability distribution at each ADC interval, such as uniform approximation, linear approximation, and any upper order approximation, may be used.
When using a uniform approximation, the probability distribution of the input data in each ADC interval is assumed to be uniform. Analysis has shown that the resulting histogram more closely resembles the ideal transformation, as binning effect artifacts are eliminated. In addition, although is not an ideal transformation, the area under the curve is maintained. The uniform approximation can be readily implemented in real-time using look-up tables. Although the uniform approximation provides results similar to the ideal transformation, imperfections may occur in the lower range at the locations of the holes of the binned histogram, as a direct result of the difference between the actual input distribution and the uniform approximation made. To improve upon the results obtained with the uniform approximation a higher order approximation, such as a linear approximation, may be employed.
As its name implies, when a linear approximation is used, the probability distribution of the input data in each ADC interval is assumed to be linear. Analysis has shown that the results are practically identical to the ideal log-transformation, and the area under the curve is again maintained. Although results can be improved using higher approximations, it does not make practical sense to do so, since the natural variability of biological populations is likely to be far greater than the error of the approximation made.
Because the amplitudes of the output histogram are non-integer, it is impossible to create a sequence of data values that generates the same output histogram. This also makes it impossible to directly enumerate cell populations, on an event by event basis, based on analysis of the output histogram; however, this is easily overcome by performing an inverse transformation of analysis results into the input domain and enumerating cell populations using input events.
In order to implement the signal processing operator of the invention in real time in a flow cytometer, it is necessary to process each data point as it is being acquired, or process all of the data after acquisition. Namely, the objective is to accumulate the histogram as the data points are acquired. Using the above log-transformed density distribution expression, and assuming that the probability distribution for any given input data range [x0, x0+1) is uniform, a non-integer bin or channel contribution in the output histogram can be computed by applying this expression to each input data point. The bin/channel contributions are then summed to form the output histogram.
Because the output histogram bins have integer boundaries, the area under the ideal curve in each bin is computed and assigned to that bin. The total area of the resulting non-integer bin contributions sum to a value of one. Adjacent data ranges produce overlapping contributions, which serve to fill in adjacent areas, namely, there is an effective xe2x80x9cspreadingxe2x80x9d of one input data point into several non-integer bins that fill in the discontinuities or xe2x80x9cholesxe2x80x9d and eliminate the xe2x80x9cspikesxe2x80x9d observed in a typical binned histogram.
Different types of non-integer bin contributions for the log transformation are computed and accumulated for each data point being mapped from the original domain to the transformed domain. Each set of bin contributions is different and is a function of the position of the histogram channel. This may be readily implemented using look-up tables. A relatively fast way to accumulate the histogram is to have pre-computed in memory all necessary non-integer bin contributions indexed by input channel. These pre-computed values can then be used as a look-up table to provide fast access to the information. The non-integer contributions are described so as to allow for efficient retrieval from memory and adding them to the output histogram. The use of pre-computed look-up tables makes computation time inconsequential, allowing the invention to be implemented effectively in real-time.