Researchers use experimental data obtained from chemical arrays such as microarrays and other similar research test equipment to cure diseases, develop medical treatments, understand biological phenomena, and perform other tasks relating to the analysis of such data. However, the conversion of useful results from this raw data is restricted by physical limitations of, e.g., the nature of the tests and the testing equipment. All biological measurement systems leave their fingerprint on the data they measure, distorting the content of the data, and thereby influencing the results of the desired analysis. Further, the systems for manufacturing and processing the arrays may also induce systematic error.
Sources of background signal can inflate the signal intensities associated with certain of the features on an array. The background signal of an array may contribute systematic feature-position-related background intensity to the measured intensity data read from the array and may cause inaccurate determination of intensity levels and the gene expression levels or other measurements corresponding thereto, during analysis. For example, systematic biases can distort microarray analysis results and thus conceal important biological effects sought by the researchers. Biased data can cause a variety of analysis problems, including signal compression, aberrant graphs, and significant distortions in estimates of differential expression. Types of systematic biases include gradient effects, differences in signal response between channels (e.g., for a two channel system), variations in hybridization or sample preparation, pen shifts and subarray variation, and differences in RNA inputs.
Gradient effects or “trends” are those in which there is a pattern of expression signal intensity which corresponds with specific physical locations on the substrate of the array and which may typically be characterized by a smooth change in the expression values from one location on the array to another. This can be caused by variations in array design, manufacturing, and/or hybridization procedures. FIG. 1 shows an example of distortion caused by gradient effects, i.e., a trend, where it can be observed that the signal intensity shows a gradually increasing pattern moving from a first edge 100 (see signals corresponding at 200) to a second edge 102 (corresponding signals 202) of the array. An additive trend is formed when the signal values are added to the amount of true signal level of the feature. A multiplicative trend is formed when the trend is a multiple of the true signal level, so that noise is somewhat proportional to the signal level of the feature. Another example of a gradient effect is a hybridization dome or “hyb dome”, which is a gradient or trend thought to occur from hybridization processing, where the signal around the perimeter of the array is significantly less than in the middle of the array, because of the impact of the bubbler that circulates the target during hybridization.
Detrending of array data is important not only for validating the data values within an array and for comparison of values within the array (intraarray comparisons), but also for valid comparison of data values between different array (interarray comparisons)
Efforts at spatially detrending array data have been made based on statistical processing of log ratio values (signal ratios between first and second channels of a scanner reading the same array, or between two single channel readings from two arrays) or on statistical processing of the signal values themselves. The latter is more difficult since signals may vary over many orders of magnitude and skew the results for some statistical approaches. By working with log ratios between signals, these values vary less and should be centered around a zero ratio value, making it much easier to apply statistical techniques to the data in a reliable fashion.
One such effort was made using a publicly available software package referred to as SNOMAD (Standardization and Normalization of Microarray Data), see Colantuoni et al., “SNOMAD (Standardization and Normalization of MicroArray Data): web-accessible gene expression data analysis”, Bioinformatics Applications Note, Vo. 18, no. 11, 2002, pp 1540-1541. SNOMAD provides scripts in the R statistical language (www.r-project.org) that are used to generate Z-scores for normalization of variance in the gene expression values of a microarray. In order to correct for variance in gene expression ratios (y-axis) that is unequal across the range of gene expression levels (x-axis), each local mean adjusted log expression ratio (y-value) is standardized to the estimation of the standard deviation of log ratio observations that share similar mean expression levels, as identified by being proximal on the x-axis, as defined by a “span” parameter. This results in the generation of Z-scores in locally estimated standard deviation units, see Parimigiani et al., The Analysis of Gene Expression Data, Springer-Verlag New York, Inc. 2003, pp. 210-217. A robust local regression (“loess”) is used to calculate the local mean gene expression ratio as it varies across the range of gene expression intensity. The calculation of local mean ratios may not be effective for certain types of trends where signal values vary depending upon the location of a feature on the array (e.g., as in the case of a hyb dome, or other spatially related trends). Further, the scripts provided in SNOMAD are not easily integratable into other analysis software packages, such as feature extraction packages, and are therefore not helpful for automating feature extraction processes.
Other efforts at removing systematic bias from a chemical array data set to effect spatial detrending involve collecting the feature signals as a subset of all feature signals on a chemical array, for each channel of the chemical array data set, wherein the intensities of the feature signals in the subset are each close to zero. These feature signals are then fit to an empirical model and the model is used to predict the local offset value for each feature signal on the array, wherein the local offset value corresponds to a feature with a zero level of biological signal. Thus, by removing the local offset value from each feature signal (which may vary depending upon the location of the particular feature on the array), the resultant offset-subtracted feature signals are intended to be true measurements of the chemical or biological entity that the features are designed to measure. This approach typically uses a predetermined percentage of the signals at the lowest end of the intensity range to fit to the empirical model. As one example, the lowest 1% of the feature signal intensities are typically used. A problem with this approach is that the values of the lowest predetermined percentage of the signals varies depending upon the makeup of the features on the array and upon the sample that was hybridized to the array, from which the signals are read. For example, referring to FIG. 2, three histograms 202, 204 and 206 are plotted to represent the signal data from three hypothetical chemical arrays all having the same number of features. The intensities of the feature signals are plotted on the horizontal axis and intensity increases moving rightward. The number of features corresponding to the intensities of the signals therefrom are plotted along the vertical axis and increase in the upward direction.
In the first plot 202, the number of features having relatively low intensities is a smaller percentage of the overall number of features than is the case with the second plot 204. Further, the plot 206 show that almost all features have very low or no signal, as the high intensity portion of the plot is very near or at zero number of features in this region. Using the above approach, if a fixed percentage of the lowest intensity signals is selected from each plot, for example, the lowest 1% of signal intensities, the selection for plot 202 includes intensities up to intensity value 212, whereas the selection for plot 204 includes intensities only up to intensity value 214, and the selection for plot 206 includes intensities only up to intensity value 216. Thus, it can be seen that the greater the relative overall percentage of relatively low intensity feature signals on the array, the lower is the estimate of the zero value for the signal data (correction for offset). Thus, this approach may have a tendency to overestimate or underestimate the offset (background noise) depending upon the makeup of the array being analyzed.
Put another way, if all of the signals in a lower peak such as the lower peak shown in plot 202, 204 or 206, for example, are signals with zero biological signal (i.e., signal from sample bound to a probe), then they are distributed in a Gaussian distribution about a peak value. The signals in the distribution are not all at the same value and thus form the Gaussian distribution because of the random, non-spatial noise introduced by the measurement system that measure that signals, thereby adding uncertainty to these signals. Ideally, a background subtraction method would find “zero” signal level to be the center of the Gaussian peak. Because the Gaussian peak has a bell shape and a width, if the background subtraction method selects the dimmest or lowest 1% of the signals on the array, the more probes that are in the Gaussian distribution, the further down to the left tail of that lower peak distribution selection base on the lowest 1% ends up. Thus, the further left that this selection results in, the further away it moves from the true “zero” level, as the signal level decreases as you move leftward along the left tail of the distribution. Therefore, FIG. 2 illustrates that using the dimmest 1% of signals to estimate background noise, assuming a finite background noise, the zero signal level will be estimated at different levels depending upon the percentage of the signals that are contained in the group of data centered around the signal seen for features with no true biological signal.
In view of the existence of offset biases such as background signals, experimentalist, designers, and manufacturers of chemical arrays and chemical array data processing systems have recognized a need for a reliable and efficient methods and systems for quantifying and removing systematic feature-position-related offset biases within a chemical array data set.