The invention relates to the analysis of gene probe microarrays and, in particular, to the analysis of image data produced by such gene probe microarrays.
Monitoring gene expression using high-density microarrays is a technique in the study of cell functions and the associated biochemical pathways, candidate gene identification, cellular response to drug compounds, and classification of disease states. For example, see:
Alon, U. et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues by oligonucleotide arrays. Proc. Natl. Acad. Sci. USA, 96, 6745-6750 (1999).
Zhu, H. et al. Cellular gene expression altered by human cytomegalovirus: global monitoring with oligonucleotide arrays. Proc. Natl. Acad. Sci. USA 95, 14470-14475 (1998).
Wodicka, L. et al. Genome-wide expression monitoring in Saccharomyces cerevisiae. Nature Biotechnology 15, 1359-1366 (1997).
Eisen, M. B. et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci.USA 95, 14863-14868 (1998).
Tamayo, P., et al. Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proc. Natl. Acad. Sci. USA 96, 2907-2912 (1999).
Golub, T. R. et al. Molecular classification of cancer Class discovery and class prediction by gene expression monitoring. Science 286, 531-537 (1999).
It appears that recent research has largely focused on enhancing the microarray technology itself and the corresponding experimental protocols. For example, see
Lockhart, D. J. et al. Expression monitoring by hybridization to high-density oligonucleotide arrays. Nature Biotechnology 14, 1675-1680 (1996).
Schena, M. et al. Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270, 467-470 (1995).
Shalon, D. et al. A DNA microarray system for analyzing complex DNA samples using two-color fluorescent probe hybridization. Genome Research 6, 639-645 (1996).
Mahadevappa, M. and Wodicka, L. A high-density probe array sample preparation method using 10- to 100-fold fewer cells. Nature Biotechnology 17, 1134-1136(1999).
Other research has focused on developing higher-level analysis methods such as clustering and classification. For example, see
Chen, Y. et al. Ratio-based decisions and the quantitative analysis of cDNA microarray images. Journal of Biomedical Optics 2, 364-374 (1997).
Chen et al. detailed algorithms for image segmentation and confidence intervals for expression ratios for cDNA microarray data.
The fundamentals of oligonucleotide expression array technology are described, for example, in the Lockhart paper cited above and are well-known in the art. The oligonucleotide expression array technology is broadly discussed here to provide a frame of reference for discussion of the invention. In particular, genes are represented on a probe array by some number of sequences of a particular length that uniquely identify the genes and that ostensibly have optimal hybridization characteristics. Each oligonucleotide (probe) is synthesized in a small cell that contains a large number (typically between 106 and 107) of copies of a given probe.
A mismatch (MM) oligonucleotide is designed to correspond to a perfect match (PM) oligonucleotide pulled from a gene sequence. In an MM oligonucleotide, typically the center base position of the oligonucleotide has been mutated. The MM probes give some estimate of the random hybridization and cross hybridization signals.
To use an oligonucleotide array, RNA samples are prepared and fluorescently labeled according to a particular protocol (e.g., the protocol set forth by Lockhart et al. in the article cited above), and then the labeled RNA sample is hybridized to the corresponding probes on the array. The array then goes through an automated staining/washing process (e.g., using an Affymetrix fluidics station), and the array is scanned using a confocal laser. The scanner generates an image of the array by exciting each cell with its laser, detects the resulting photon emissions from the fluorescently labeled RNA that has hybridized to the probes in the cell, and then converts the detected photon emissions into a raw intensity value for each cell. xe2x80x9cFeaturesxe2x80x9d (comprised of groups of cells) are xe2x80x9cextractedxe2x80x9d based on the images, and characteristic feature intensities are computed from the raw cell intensities. It can be determined from the features"" xe2x80x9ccharacteristic intensityxe2x80x9d whether a particular gene is present in the array, and the quantity at which the gene is present.
Conventional feature extraction is now discussed in greater detail. For example, as discussed by Wodicka et al.(1997), the raw oligonucleotide array image has recognizable patterns at each corner that allows the determination of the positions of the corners of the array. The number of features in each row and column is known. Once the corners are determined, the positions of each feature in the array are computed.
As can be seen from FIG. 1A, the boundary pixels of a feature are typically distorted by blurring (i.e. their levels are xe2x80x9cpulledxe2x80x9d towards the level of a neighboring feature) and do not faithfully represent the true intensity of the feature. Therefore, the boundary pixels are conventionally removed before the characteristic feature intensity is computed. That is, the intensities of the boundary pixels of a feature are not considered in determining a characteristic intensity value for the feature. In most cases, after removing the boundary pixels from a feature, the feature is represented by a 6xc3x976 block of pixels that remain.
Then, the characteristic intensity for the feature is determined, for example, by computing an average intensity of the remaining pixels. It can be seen from FIG. 1A that determining the median of the remaining 6xc3x976 pixels often results in determining the median value from a more variable region than, say, the most homogenous block of pixels (e.g., a 4xc3x974 pixel block) within the 6xc3x976 pixel block. This can result in a downward bias from the xe2x80x9ctruexe2x80x9d characteristic feature intensity.
Furthermore, FIG. 1B illustrates how a misalignment of the basic grid can result in a failure to extract the central part of the true feature.
What is desired is a feature extraction method that more robustly and reliably extracts the xe2x80x9cusefulxe2x80x9d portion of a true feature for determining characteristic feature intensity.
Furthermore, it is well known that the comparison of gene expression results across experiments is enhanced when the results of the experiments are normalized to a single scale. Normalizing multiple probe arrays to allow direct array-to-array comparisons has presented a great challenge. Conventional normalization methods include 1) linear normalization and nonlinear regression, and 2) methods using housekeeping genes or staggered spike-in controls.
With linear normalization, it is assumed that the intensities between two or more arrays are related as a straight line with a zero y-intercept. Its use leads to multiplication by a scaling factor (slope of the line) to make the mean of the xe2x80x9cexperimentxe2x80x9d chip the same as that of the baseline chip. A description of this technique applied to Affymetrix probe arrays is given by Alon et al. (1999). For example, see page 6746, lines 2-4 which states that
xe2x80x9cTo compensate for possible variations between arrays, the intensity of each EST on an array was divided by the mean intensities of all ESTs on that array and multiplied by a nominal average intensity of 50.xe2x80x9d
Ignoring the slight differences of the number of retained probe pairs per gene (due to outlier probe removal), the essential effect of these operations is equivalent to the multiplication of each probe pair difference by a constant scaling factor.
Chen et al. (1998) describe an application of the linear normalization technique to cDNA spotted arrays, where one intensity channel is normalized against another on the same array. For example, on page 371, formulae (12) and (13) represent a linear scaling operation across the whole array. (The Chen paper actually used a more complicated procedure where the scaling is applied iteratively in connection with the fitting of the density for the ratios.) Although the linear normalization technique is simple and robust, this method has the drawback that it does not account well for nonlinear relations. For example, FIG: 2 illustrates a situation where the slope in the low intensity region (of the scatter plot of PM/MM differences between two arrays) is substantially different from the slope in the high intensity region. In fact, a 10%-50% difference in slope values between regions is quite common. A non-linear regression technique (e.g., generalized cross-validation or GCVSS as described in Wahba, G. Spline Methods for Observational Data. CBMS-NSF regional conference series in applied mathematics. Philadelphia: SIAM (1990)) may be employed, but even non-linear regression can be inadequate if the expression profiles of the various arrays vary greatly from each other.
The drawbacks of the conventional normalization methods can be seen with reference to FIG. 2. Specifically the PM/MM difference from two murine Affymetrix Mu6500SubA probe array experiments are plotted in FIG. 2. The line 202 is the line generated by the LR normalization method, the curve 204 is generated by the GCVSS method, and the lines 206 and 208 are generated by applying the LR method to low and high differences, respectively (the low/high cutoff was determined empirically from the GCVSS line 204 matches the line 206 at the low end and the 208 and 202 lines at the high end, although the data between the two experiments is not really linearly related.
It has also been suggested (e.g., see Ermolaeva et al; 1998) that normalization between arrays can be based on a set of xe2x80x9chousekeepingxe2x80x9d genes. Unfortunately, many of the genes conventionally used as housekeeping genes (e.g., 8 actin, glyceraldehyde-3-phosphate dehydrogenase, transferrin receptor, signal transducer and activator of transcription 1, among others) have ranges of differential expression similar to other genes whose differential expression patterns are deemed biologically relevant to the system under study. In accordance with one known method, control cRNAs for bacterial and phage genes (e.g., BioB, BioC, BioD, and cre) are consistently added to hybridization mixtures at known concentrations. However, these controls are often prepared in bulk and completely independently of the sample being profiled, and so, the normalization relation between the controls on different arrays typically does not reflect the true normalization relation for the biologically relevant genes of interest.
What is desired is a normalization method that more accurately reflects the true normalization relation.
In accordance with one aspect of the invention, a method is provided to determine a characteristic intensity of a feature in image data generated by scanning a microarray probe. A set of pixels of the image data that nominally represent the feature is identified. The pixels each have an value (such as an intensity value) associated therewith. For each of a plurality of subsets of the set of pixels, a variation statistic value is determined that corresponds to a variation in the values associated with the pixels of that subset. One of the subsets of pixels is chosen based on the determined variation statistic values.
In accordance with a further aspect of the invention, a method is provided to relate a first expression array of probes to a second expression array of probes. A subset of the probe for the arrays is determined based on a comparison of the ordering of the subset of the probes of the second array, according to a particular characteristic of the probes, to the ordering of corresponding probes in the first array according to the particular characteristic of the probes. A relationship of the second expression array to the first expression array is determined based on the subset of probes of the second expression array to the corresponding probes of the first array.