Researchers use experimental data obtained from chemical arrays such as microarrays and other similar research test equipment to cure diseases, develop medical treatments, understand biological phenomena, and perform other tasks relating to the analysis of such data. However, the conversion of useful results from this raw data is restricted by physical limitations of, e.g., the nature of the tests and the testing equipment. All biological measurement systems leave their fingerprint on the data they measure, distorting the content of the data, and thereby influencing the results of the desired analysis. Further, the systems for manufacturing and processing the arrays may also induce systematic error. For example, systematic biases can distort microarray analysis results and thus conceal important biological effects sought by the researchers. Biased data can cause a variety of analysis problems, including signal compression, aberrant graphs, and significant distortions in estimates of differential expression. Types of systematic biases include gradient effects, differences in signal response between channels (e.g., for a two channel system), variations in hybridization or sample preparation, pen shifts and subarray variation, and differences in RNA inputs.
Gradient effects or “trends” are those in which there is a pattern of expression signal intensity which corresponds with specific physical locations on the substrate of the array and which may typically be characterized by a smooth change in the expression values from one location on the array to another. This can be caused by variations in array design, manufacturing, and/or hybridization procedures. FIG. 1 shows an example of distortion caused by gradient effects, i.e., a trend, where it can be observed that the signal intensity shows a gradually increasing pattern moving from a first edge 100 (see signals corresponding at 200) to a second edge 102 (corresponding signals 202) of the array. A multiplicative trend is formed when the signal values are multiplied relative to the amount of the true signal level, so that noise is somewhat proportional to the signal level of the feature. One example of a hybridization dome or “hyb dome” is a gradient or trend thought to occur from hybridization processing, where the signal around the perimeter of the array is significantly less than in the middle of the array, because of the impact of the bubbler that circulates the target during hybridization. However, other shapes may result from non-uniform distribution of the target solution as it is mixed or moved during hybridization processing.
De-trending of array data is important not only for validating the data values within an array and for comparison of values within the array (intra-array comparisons), but also for valid comparison of data values between different arrays (interarray comparisons).
Efforts at spatially detrending array data have been made based on statistical processing of log ratio values (signal ratios between first and second channels of a scanner reading the same array, or between two single channel readings from two arrays) or on statistical processing of the signal values themselves. The latter is more difficult since signals may vary over many orders of magnitude and skew the results for some statistical approaches. By working with log ratios between signals, these values vary less and should be centered around a zero ratio value, making it much easier to apply statistical techniques to the data in a reliable fashion.
One such effort was made using a publicly available software package refeffed to as SNOMAD (Standardization and Normalization of Microarray Data), see Colantuoni et al., “SNOMAD (Standardization and Normalization of Micro Array Data): web-accessible gene expression data analysis”, Bioinformatics Applications Note, Vol. 18, no. 11, 2002, pp 1540-1541. SNOMAD provides scripts in the R statistical language (r.proiect.org) that are used to generate Z-scores for normalization of variance in the gene expression values of a microarray. In order to correct for variance in gene expression ratios (y-axis) that is unequal across the range of gene expression levels (x-axis), each local mean adjusted log expression ratio (y-value) is standardized to the estimation of the standard deviation of log ratio observations that share similar mean expression levels, as identified by being proximal on the x-axis, as defined by a “span” parameter. This results in the generation of Z-scores in locally estimated standard deviation units, see Parimigiani et al., The Analysis of Gene Expression Data, Springer-Verlag New York, Inc. 2003, pp. 210-217. A robust local regression (“loess”) is used to calculate the local mean gene expression ratio as it varies across the range of gene expression intensity. The calculation of local mean ratios may not be effective for certain types of trends where signal values vary depending upon the location of a feature on the array (e.g., as in the case of a hyb dome, or other spatially related trends). Further, the scripts provided in SNOMAD are not easily integratable into other analysis software packages, such as feature extraction packages, and are therefore not helpful for automating feature extraction processes.
Thus there is a continuing need for spatial detrending algorithms, techniques and systems that rely upon data obtained from features across diverse locations of an array/substrate to provide more reliable spatial detrending of the signal data when it is affected by location of the features from which the data has been extracted. There is a need for spatial detrending algorithms, techniques and systems for detrending a gradient from any effect, provided that the distortions in signal responsible for the gradient are proportional to the signals at corresponding locations over the gradient.