Researchers use experimental data obtained from microarrays and other similar research test equipment to cure diseases, develop medical treatments, understand biological phenomena, and perform other tasks relating to the analysis of such data. However, the conversion of useful results from this raw data is restricted by physical limitations of, e.g., the nature of the tests and the testing equipment. All biological measurement systems leave their fingerprint on the data they measure, distorting the content of the data, and thereby influencing the results of the desired analysis. For example, systematic biases can distort microarray analysis results and thus conceal important biological effects sought by the researchers. Biased data can cause a variety of analysis problems, including signal compression, aberrant graphs, and significant distortions in estimates of differential expression. Types of systematic biases include gradient effects, differences in signal response between channels (e.g., for a two-channel system), variations in hybridization or sample preparation, pen or nozzle shifts during array manufacture (e.g., using an inject printer), subarray variation, and differences in RNA inputs.
Gradient effects are those in which there is a pattern of expression signal intensity which corresponds with specific physical locations within the chip (e.g., array substrate) and which are characterized by a smooth change in the expression values from one end of the chip to another. This can be caused by variations in chip design, manufacturing, and/or hybridization procedures. FIG. 1 shows an example of distortion caused by gradient effects, where it can be observed that the signal intensity shows a gradually increasing pattern moving from a first edge 10 (see corresponding signals 20) to a second edge 12 (corresponding signals 22) of the chip.
In dual-channel systems the two dyes do not always perform equally efficiently, for equivalent RNA concentrations, uniformly across the whole microarray. In particular, the red channel often demonstrates higher signal intensity than the green channel at higher RNA abundances.
Variations in hybridization and sample preparations can cause warpage to occur in the expression values in microarrays. This can prevent comparative analysis across batches of arrays and further distort analysis results.
Subarray variations are forms of systematic biases in which different probe subsets within the chip demonstrate significantly different performance characteristics. In particular, there may be multiple subsets that have different average signal intensities. This is sometimes referred to as “blocking” within the resultant array pattern, due to the visual, block-like appearance resultant when the subset includes probes that are adjacent to each other. These subarray variations may be related to individual pens/nozzles of an inkjet used to print the array, or other manufacturing component discreteness or boundaries (for example, when two array patterns are intersected to create higher probe density, there may be a shift between the two patterns), as well as other process details and are typically causally represented as ANOVA nominal variables.
Device distortions/variations aside, other problems facing researchers are the tasks of quality control and assessment of microarray measurements. Because they are often performed manually and heuristically, these tasks are time-consuming, expensive, and prone to error.
Because of the large amount of data involved, inspection and review of microarray results is complex and tedious, requiring knowledge of multiple microarray technology platforms and manufacturing techniques. Review of microarray data is also time-consuming and costly because, using manual inspection, 40%-50% of all hybridized microarrays typically require at least one interpretation of the acceptability of the results. Thorough inspection at these levels becomes cost prohibitive as the number of hybridizations performed per week increases into the hundreds or thousands.
In addition, manual review of microarray data is imprecise and inconsistent. Agreement between research scientists is frequently less than 60%. To avoid systematic shifts in an inspector's judgment over time, inspectors must constantly be “re-calibrated” (i.e., re-trained) to their own previous judgments as well as to the judgments of others. Moreover, marginal cases are difficult to flag. As the volume of hybridizations increases, identification of marginal cases or close calls becomes difficult. These cases may require more detailed study or expert opinions to properly classify and quantify results. Lastly, heuristic thresholds are often set on quality control parameters. Thresholds on quality control parameters are frequently set independently for each parameter without statistical adjustment for the multiplicity of tests being performed. This leads to increased failure rates and increased costs.
Analysis of variance (ANOVA) is used to uncover the main and interaction effects of categorical independent variables (sometimes referred to as “factors”) on a dependent variable based on a functional association. Categorical dependent variables may also be supported. A “main effect” is the individual effect of an independent variable on the dependent variable, a combinatorial impact. A more extensive discussion of existing ANOVA techniques can be found at http://www2.chass.ncsu.edu/garson/pa765/anova.htm. A copy of this document which was downloaded on Nov. 5, 2004 and is being submitted as a disclosure document in this application, is hereby incorporated herein, in its entirety, by reference thereto.
Multivariate, or N-way ANOVA addresses N independent factor variables. As the number of independent variables increases, the number of potential interactions proliferates. For example, the consideration of two independent variables A and B considers only a single first-order interaction (AB). Consideration of three independent variables A, B and C requires analysis of three second-order interactions (AB, AC, BC) and one third-order interaction (ABC). Consideration of four independent variables requires a consideration of six second order (AB, AC, AD, BC, BC, CD), three third-order (ABC, ACD, BCD), and one fourth-order (ABCD) interaction, or ten interactions in all. As the number of interactions increases, it becomes increasingly difficult to calculate the model estimates. Even for models having only a few ANOVA variables, the same problems may be experienced when there are many nominal or ordinal levels to one or more of the variables.
Analysis of variance in biological applications, for example, typically present vary large numbers of variables, making it extremely difficult, time consuming and costly to calculate results. Depending upon the characterizations of the variables, it may be practically impossible to apply current analysis of variable techniques and expect to achieve results because of the enormity of the calculations required. For example, in applications such as genomics, it is not atypical to be faced with data-analysis problems involving multiple categorical variables that can have tens of thousands of levels, such as gene names or probe designations, as just two examples of many. Interactions among such variables may produce tens of millions of data columns. Such columns for this example, however, tend to be sparse in that the variables mentioned can be represented as nominal variables. Large matrices of columns sparsely populated by data can typically be reduced to smaller matrices, especially when there is some pattern in the sparsely populated data columns that can be identified. The smaller matrices are much more computationally manageable, making analysis of variance techniques practically useable when only nominal variables are being considered.
However, when considering factors that reflect process conditions associated with processing the probes, these factors are represented by variables that are categorical (class) variables, such as ordinal or nominal and scaled variables with a continuum of levels, the data columns of which are typically non-sparse. The inclusion of non-sparse data columns into a matrix to be analyzed by least squares optimization makes it impossible to reduce the matrix size by any effective amount, leaving enormous numbers of calculations to be performed. Statistical analysis by well-known software products, such as JMP*SAS (http://www.jmp.com/) and SAS, for example, can take days to process such data due to limitations in memory and CPU speed on a typical computer running the software.
There is a continuing need for tools to characterize chemical arrays for quality control purposes, as well as for comparing results among arrays.