1. Technical Field
The present invention is generally directed to the field of processing genomic data. More specifically, the invention relates to a system and method for performing statistical outlier detection for gene expression microarray data.
2. Description of the Related Art
In genomics research, gene expression arrays are a breakthrough technology enabling the measurement of tens of thousands genes"" transcription simultaneously. Because the numerical data associated with expression arrays usually arises from image processing, data quality is an important issue.
Two recent scientific articles, Schadt et al. (2000) and Li and Wong (2001), discuss this data quality issue for one of the most popular expression array platforms, the Affymetrix GeneChip(trademark). For example, they point out that outlier problems may arise due to particle contaminations (see, FIG. 1 in Schadt et al. (2000)) or scratch contaminations (see FIG. 5 in Li and Wong (2001)). They indicate that improper statistical handling of aberrant or outlying data points can mislead analysis results.
Li and Wong propose an outlier detection method based on a multiplicative statistical model. While this approach is useful, it is limited to Affymetrix data and lacks the flexibility to accommodate more complex experimental designs. The multiplicative model used by the Li and Wong is as follows:
Yij=xcex8i"PHgr"j+xcex5ij, xcexa3j"PHgr"j2=J, xcex5ijxcx9cN(0, "sgr"2).xe2x80x83xe2x80x83(1) 
Yij is the intensity measurement of the jth probe in the ith array. xcex8i is the ith fixed array effect, "PHgr"j is the jth fixed probe effect, and J is the number of probes. The xcex5ijxe2x80x2s are assumed to be independent identically distributed normal random variables with mean 0 and variance "sgr"2. With the assumption of knowing "PHgr"s or xcex8s, the following conditional means and standard errors can be derived and used in the Li and Wong method.                                                         θ              ~                        i                    =                                                    ∑                j                            ⁢                                                Y                  ij                                ⁢                                  Φ                  j                                                                                    ∑                j                            ⁢                              Φ                j                2                                                    ,                              Φ            j                    =                                                    ∑                i                            ⁢                                                Y                  ij                                ⁢                                  θ                  i                                                                                    ∑                i                            ⁢                              θ                i                                                    ,                                                      StdErr            ⁢                          xe2x80x83                        ⁢                          (                                                θ                  ~                                i                            )                                =                                                                      ∑                  j                                ⁢                                                      (                                                                  Y                        ij                                            -                                                                        Y                          ^                                                ij                                                              )                                    2                                                            J                ⁡                                  (                                      J                    -                    1                                    )                                                                    ,                                                      StdErr            ⁢                          xe2x80x83                        ⁢                          (                             ⁢                                                Φ                  ~                                i                            )                                =                                                                      ∑                  i                                ⁢                                                      (                                                                  Y                        ij                                            -                                                                        Y                          ^                                                ij                                                              )                                    2                                                            K                ⁡                                  (                                      K                    -                    1                                    )                                                                    ,                                K        =                              ∑            i                    ⁢                                                    θ                ~                            i              2                        .                              
The following is a description of the Li and Wong outlier detection approach:
1. Check array outliersxe2x80x94Fit the model (1) and calculate the conditional standard errors for all xcex8ixe2x80x2s. Designate array as array outlier if either of the following criteria are met:
i. Associated xcex8 has standard error larger than three times the median standard error of all xcex8ixe2x80x2s.
ii. Associated xcex8 has dominating magnitude with square value larger than 0.8 times the sum of squares of all xcex8s.
Select out those array outliers and go to step 2.
2. Check probe outliersxe2x80x94Fit the model (1) and calculate the conditional standard error for all "PHgr"jxe2x80x2s. Designate probe as probe outlier if either of the following criteria are met:
i. Associated "PHgr" has standard error larger than three times the median standard error of all "PHgr"jxe2x80x2s.
ii. Associated "PHgr" has dominating magnitude with square value larger than 0.8 times the sum of squares of all xcex8jxe2x80x2s.
Select out those probe outliers and go to step 3.
3. Iterate steps 1 and 2 until no further array or probe outliers selected.
In accordance with the disclosure below, a computer-implemented method and system are provided for detecting outliers in microarray data. A mixed linear statistical model is used to generate predictions based upon the received microarray data. Residuals are generated by subtracting model-based predictions from the original microarray sample data. Statistical tests are performed for residuals by adding covariates to the mixed model and testing their significance. Data from the microarrays are designated as outliers based upon the tested significance.