Poor reproducibility of microarray expression measurements under varying experimental conditions has been a significant impediment to widespread adoption into clinical practice. Variation among expression values has been classified as biologically interesting and obscuring. See, for example, Bolstad et al., 2003 Bioinformatics 19, 185-193, which is hereby incorporated by reference herein in its entirety. Previous research, Bolstad et al., 2003 Bioinformatics 19, 185-193, and Lyons-Weiler, 2003, Applied Bioinformatics: 2, 193-195, has identified the following sources of obscuring variation among microarray datasets conducted on replicate samples in multiple laboratories: (i) differences in sample preparation (for example, total RNA preparation, amplification and labeling), (ii) differences in the production and age of the arrays, and (iii) differences in the processing of the arrays (for example, time, temperature, drying and washing protocols, scanner differences).
The computational process by which the microarray expression measurements are converted to mutually comparable values is referred to as standardization. Initial efforts at microarray standardization involved dividing the log-expression values on a microarray by the mean expression of all genes across the microarray. This approach works well if the relation between cellular constituent abundance for a given gene (which is the quantity microarrays are designed to measure) and hybridization signal measured by the scanner is approximately linear across replicate samples. However, it has been established (Bolstad et al. 2003, Bioinformatics 19, 185-193, Moraleda et al., 2004, Proceedings of the American Society of Clinical Oncology annual meeting Vol. 23, each of which is hereby incorporated by reference herein) that in practice this relation is non-linear for common microarray designs and typical clinical specimens, saturating at higher levels of mRNA abundance. As a result, focus of the research has shifted toward non-linear transformations that compensate for this effect.
Earlier approaches have used the notion of housekeeping genes to effect the transformation. See, for example, Kohane et al., 2003 Microarrays for Integrative Genomics The MIT Press, 2003. This method makes the fundamental assumption that genes with similar levels of expression are affected in similar ways by the obscuring variations. This idea is the basis for leading methods of microarray standardization, including quantile normalization (Bolstad et al. 2003, Bioinformatics 19, 185-193, which is hereby incorporated by reference herein) and invariant set normalization (Li et al., 2003, The Analysis of Gene Expression Data Methods and Software, Springer, pp. 120-141, which is hereby incorporated by reference herein). Quantile normalization considers a set of arrays, and normalizes each against all others such that the quantiles of all arrays agree after the normalization. Invariant set normalizes a pair of arrays at a time such that the non-differentially expressed genes in the two arrays have similar ranks after the normalization.
Earlier standardization approaches that make use of housekeeping genes are further based on the fundamental assumption that there exist housekeeping genes, defined as genes participating in fundamental cell processes, which have a well-understood level of expression among a wide variety of cell types and conditions. Existence of housekeeping genes and their utility in microarray studies has been recognized previously. See, for example, Warrington et al., 2000, Physiol. Genomics 2, 143-147; and de Kok et al., 2005, Laboratory Investigation 85, 154-159, each of which is hereby incorporated by reference herein in its entirety.
Known approaches to standardization have proven to be useful in situations where the microarray data is from a single laboratory source. However, such standardization approaches have proven to be deficient when the microarray is from multiple different sources. In particular, known standardization approaches have proven to be deficient when attempts are made to standardize data from a laboratory source that was not used in the standardization learning set. Given the above background, what is needed in the art are improved systems and methods for standardizing test microarray datasets.