Array-based genetic analyses start with a large library of cDNAs or oligonucleotides (probes), immobilized on a substrate. The probes are hybridized with a single labeled sequence, or a labeled complex mixture derived from a tissue or cell line messenger RNA (target). As used herein, the term “probe” will therefore be understood to refer to material tethered to the array, and the term “target” will refer to material that is applied to the probes on the array, so that hybridization may occur.
There are two kinds of measurement error, random and systematic. Random error can be detected by repeated measurements of the same process or attribute and is handled by statistical procedures. Low random error corresponds to high precision. Systematic error (offset or bias) cannot be detected by repeated measurements. Low systematic error corresponds to high accuracy.
Background correction involves subtracting from the probe the intensity of an area outside of that probe. Areas used for calculation of background can be close to the probe (e.g. a circle lying around the probe), or distant. For example, “blank” elements can be created (i.e., elements without probe material), and the value of these elements can be used for background estimation.
Normalization procedures involve dividing the probe by the intensity of some reference. Most commonly, this reference is taken from a set of probes, or from the mean of all probes.
Once systematic error has been removed by background removal and normalization procedures (or others, as required), any remaining measurement error is, in theory, random. Random error reflects the expected statistical variation in a measured value. A measured value may consist, for example, of a single value, a summary of values (mean, median), a difference between single or summary values, or a difference between differences. In order for two values to be considered reliably different from each other, their difference must exceed a threshold defined jointly by the measurement error associated with the difference and by a specified probability of concluding erroneously that the two values differ (Type I error rate).
Of primary interest are differences between two or more quantified values, typically across different conditions (e.g., diseased versus non-diseased cell lines, drug versus no drug). The desired estimate of expected random error ideally should be obtained from variation displayed by replicate values of the same quantity. This is the way that such estimates are normally used in other areas of science. Hybridization studies, however, tend to use a very small number of replicates (e.g., two or three). Estimates of random error based on such small samples are themselves very variable, making comparisons between conditions using standard statistical tests imprecise and impractical for all but very large differences.
This difficulty has been recognized by Bassett, Eisen, & Boguski in, “Gene expression informatics: It's all in your mine”, Nature Genetics, 21, 51–55 (1999), who have argued that the most challenging aspects of presenting gene expression data involve the quantification and qualification of expression values and that qualification would include standard statistical significance tests and confidence intervals. They argued further that “ideally, it will be economically feasible to repeat an experiment a sufficient number of times so that the variance associated with each transcript level can be given” (p. 54). The phrase “sufficient number of times” in the preceding quote highlights the problem. The current state-of-the-art in array-based studies precludes obtaining standard statistical indices (e.g., confidence intervals, outlier delineation) and performing standard statistical tests (e.g., t-tests, analyses-of-variance) that are used routinely in other scientific domains, because the number of replicates typically present in studies would ordinarily be considered insufficient for these purposes. A key novelty in the present invention is the circumvention of this difficulty.
Statistical indices and tests are required so that estimates can be made about the reliability of observed differences between probe/target interactions across different conditions. The key question in these kinds of comparisons is whether it is likely that observed differences in measured values reflect random error only or random error combined with treatment effect (i.e., “true difference”)? In the absence of formal statistical procedures for deciding between these alternatives, informal procedures have evolved in prior art. These procedures can be summarized as follows:                1. Arbitrary thresholds. Observed differences across conditions differ by an arbitrary threshold. For example, differences greater than 2- or 3-fold are judged to reflect “true” differences.        2. Thresholds established relative to a subset of array elements. A subset of “reference” genes is used as a comparison point for ratios of interest. For example, relative to the reference gene, a gene may show a 2:1 expression ratio when measured at time 1, a 2.8:1 ratio when measured at time 2, and so on.        3. Thresholds established based on observed variation in background. The standard deviation of background values is used as a proxy for the measurement error standard deviation associated with probe values of interest. If a probe intensity exceeds the background standard deviation by a specified number (e.g., 2.5), the probe is considered “significant.”        
None of the above approaches is optimal, because each relies on a relatively small number of observations for deriving inferential rules. Also, assessments of confidence are subjective and cannot be assessed relative to “chance” statistical models. Approaches 1 and 2 are especially vulnerable to this critique. They do not meet standards of statistical inference generally accepted in other fields of science in that formal probability models play no role in the decision-making process. Approach 3 is less subject to this latter critique in that a proxy of measurement error is obtained from background. It is nonetheless not optimal because the measurement error is not obtained directly from the measured values of interest (i.e., the probes) and it is not necessarily the case that the error operating on the background values is of the same magnitude and/or model as the one operating on probe values.
Other informal approaches are possible. For example, the approaches described in 2 above could be modified to estimate the standard deviations of log-transformed measurements of reference genes probed more than once. Because of the equality [log(a)−log(b)=log(a/b)], these proxy estimates of measurement error could then be used to derive confidence intervals for differential ratios of log-transformed probes of interest. This approach would nonetheless be less than optimal because the error would be based on proxy values and on a relatively small number of replicates.
Chen et al. (Chen, Dougherty, & Bittner) in “Ratio-based decisions and the quantitative analysis of cDNA microarray images”, Journal of Biomedical Optics, 2, 364–374 (1997) have presented an analytical mathematical approach that estimates the distribution of non-replicated differential ratios under the null hypothesis. Like the present invention, this procedure derives a method for obtaining confidence intervals and probability estimates for differences in probe intensities across different conditions. However, it differs from the present invention in how it obtains these estimates. Unlike the present invention, the Chen et al. approach does not obtain measurement error estimates from replicate probe values. Instead, the measurement error associated with ratios of probe intensities between conditions is obtained via mathematical derivation of the null hypothesis distribution of ratios. That is, Chen et al. derive what the distribution of ratios would be if none of the probes showed differences in measured values across conditions that were greater than would be expected by “chance.” Based on this derivation, they establish thresholds for statistically reliable ratios of probe intensities across two conditions. The method, as derived, is applicable to assessing differences across two conditions only. Moreover, it assumes that the measurement error associated with probe intensities is normally distributed. The method, as derived, cannot accommodate other measurement error models (e.g., lognormal). It also assumes that all measured values are unbiased and reliable estimates of the “true” probe intensity. That is, it is assumed that none of the probe intensities are “outlier” values that should be excluded from analysis. Indeed, outlier detection is not possible with the approach described by Chen et al.
The approaches described above attempt to address issues that relate to how large differences across conditions must be before they are considered sufficiently reliable to warrant a conclusion of “true” difference. Distinguishing between probe values that represent signal and those that represent nonsignal represents a different issue which relates to the qualification of probe values within arrays rather than across conditions.
Two approaches have been presented Piétu et al. (Piétu, Alibert, Guichard, and Lamy), observed in “Novel gene transcripts preferentially expressed in human muscles revealed by quantitative hybridization of a high density cDNA array”, Genome Research, 6, 492–503 (1996) in their study that a histogram of probe intensities presented a bimodal distribution. They observed further that the distribution of smaller values appeared to follow a Gaussian distribution. In a manner not described in their publication, they “fitted” the distribution of smaller values to a Gaussian curve and used a threshold of 1.96 standard deviations above the mean of the Gaussian curve to distinguish nonsignals (smaller than the threshold) from signals (larger than the threshold).
Chen et al. (cited above) describe the following method for assessing whether a probe represents a signal or nonsignal value. Within a digitized image of an array, pixels within each probe area are rank-ordered. The intensity of the eight lowest pixel values is compared to background via a non-parametric statistical test (Mann-Whitney U-test) If results of the statistical test supports the conclusion that these eight pixel values are above background, the procedure stops and the probe is considered a signal. If the eight pixel values are not above background, some or all of the pixels are considered to be at or below background. The same test is repeated by either eliminating all eight pixels and repeating the test with the next eight lowest pixel values or by eliminating a subset of the eight pixels and replacing them with the same number of the next lowest values. The test proceeds in this fashion until all pixels are estimated to be at or below background or until a threshold of number of pixels is reached. In either case, the probe is classified as nonsignal.
The macro format (FIGS. 1, 4) was introduced some years ago and is in fairly widespread use. Typically, probes are laid down on membranes as spots of about 1 mm in diameter. These large spots are easily produced with robots, and are well suited to isotopic labeling of targets, because the spread of ionizing radiation from an energetic label molecule (e.g. 32P) precludes the use of small, closely-spaced probes. Detection is most commonly performed using storage phosphor imagers.
Microarrays consisting of oligonucleotides synthesized on microfabricated devices have been in use for some time. With the recent commercial availability of microarraying and detection apparatus, microarrays of single-stranded cDNAs deposited on are seeing broader use.
With both micro and macro genome arrays, numerical data are produced by detecting the amount of isotope or fluorescent label at each assay site. The result is one or more arrays of numbers, each member of which quantifies the extent of hybridization at one assay in the specimen array. The hybridization level is an indication of the expression level of sequences complementary to a specific probe. Therefore, analysis can be used to both identify the presence of complementary sequences, and to quantify gene expression leading to those complementary sequences.
The analysis proceeds by determining which specific assays show interesting alterations in hybridization level. Typically, alterations in hybridization are specified as ratios between conditions. For, example, data may be of the form that assay X (representing expression of a particular gene) is three times as heavily labeled in a tumor cell line as in a normal cell line. The relevant issue is “how is the statistical significance of a specific comparison to be specified?”
Specification of statistical significance is important because of the presence of error in our measurements. We could define true hybridization as the amount that would be observed if procedural and measurement error were not present. Ideally, the same probe-target pairing would always give us the same measured hybridization value. Valid hybridization values are those which index true hybridization.
In fact, hybridization tends to be heavily influenced by conditions of the reaction and by measurement error. The mean coefficient of variation in a replicated fluorescent microarray often hovers near 25%. That is, repeated instances of hybridization between the same probe and target can yield values which vary considerably about a mean (the best estimate of true hybridization). Therefore, any single data point may or may not be an accurate reflection of true hybridization.
The present invention differs from prior art in that it estimates measurement error directly from array replicates (within or across arrays). The present invention is able to provide statistically valid inferences with the small numbers of replicates (e.g., three) characteristic of array hybridization studies. In the present invention, the statistical difficulties posed by small sample sizes are circumvented by the novel process of obtaining an estimate of measurement error for each probe based on the average variance of all replicates for all probes. In accordance with one preferred aspect, the invention assumes that all replicates, being part of the same population of experiments and being similarly treated during array processing, share a common and/or constant variance.
In accordance with another preferred aspect, measurement error can be assessed separately for different probe classes. These classes may be determined based on the deconvolution procedures described below or by other statistical or experimental methods.
The present invention differs from all prior art in that it:                1. is applicable to any number of experimental conditions rather than being restricted to only two conditions;        2. estimates measurement error empirically from probe replicates;        3. can detect outliers;        4. can accommodate various measurement error models; and        5. can assess the adequacy of an assumed measurement error model.        
There is a second aspect to the present invention, which deals with the discrimination of probe response classes within arrays. Element measurements within arrays may reflect multiple classes of values. For example, some values may represent signals and others may represent nonsignals (e.g., background). As another example, some values may represent a family of genes associated with disease states, while other values originate from genes not known to be altered in disease. The present invention is novel in that it uses a mathematically-derived approach for deconvolving any mixture of distinct underlying distributions, which is used in turn to classify probe values as signal or nonsignal.
Specifically, the present invention is novel in its method of treating overlapping distributions within the arrayed data. In particular, the invention models dual or multiple distributions within an array. Preferably, it does this by mathematical mixture modeling which can be applied to deconvolve distributions and regions of overlap between distributions in a rigorous fashion. This contrasts with prior art, which fails to model more than one distribution with array data and which, therefore, is unable to model regions of overlap between distributions. As a consequence, prior art may miss data (e.g., probes with low signal levels) which have acceptable probabilities of belonging to a valid signal distribution. The present invention assigns probabilities that any probe belongs to one of the contributory distributions within an array data population.