Development of diagnostic and outcome prediction models and discovery from DNA microarray data is of great interest in bioinformatics and medicine. Diagnostic models from gene expression data go beyond traditional histopathology and can provide accurate, resource-efficient, and replicable diagnosis. (See, Golub T R, Slonim D K, Tamayo P, Huard C, Gaasenbeek M, Mesirov J P, Coller H, Loh M L, Downing J R, Caligiuri M A, Bloomfield C D, Lander E S, “Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.” Science, 1999 Oct. 15; 286(5439):531-7.) Furthermore, biomarker discovery in high-dimensional microarray data facilitates discoveries about the biology. (See, Balmain A, Gray J, Ponder B, “The genetics and genomics of cancer.” Nat. Genet. 2003 March; 33 Suppl:238-44. Review.)
Building classification models from microarray gene expression data has three challenging components: collection of samples, assaying, and statistical analysis. A typical statistical analysis process takes from a few weeks to several months and involves interactions of many specialists: clinical researchers, statisticians, bioinformaticians, and programmers. As a result, statistical analysis is a serious bottleneck in the development of molecular microarray-based diagnostic, prognostic or individualized treatment models (typically referred to also as “personalized medicine”).
Even if the long duration and high expenses of the statistical analyses process as described above is considered acceptable, its results frequently suffer from two major pitfalls. First, as documented in many published studies, analyses are affected by the problem of overfitting; that is creating predictive models that may not generalize well to new data from the same disease types and data distribution despite excellent performance on the training set. Since many algorithms are highly parametric and datasets consist of a relatively small number of high-dimensional samples, it is easy to overfit both the classifiers and the gene selection procedures especially when using intensive model search and powerful learners. In a recent meta-analytic assessment of 84 published microarray cancer outcome predictive studies (see, Ntzani E E, Ioannidis J P. “Predictive ability of DNA microarrays for cancer outcomes and correlates: an empirical assessment.” Lancet. 2003 Nov 1, 362(9394): 1439-44.), it was found that only 26% of studies in this domain attempted independent validation or cross-validation of their findings. Thus it is doubtful whether these models will generalize well to unseen patients. The second methodological problem is underfitting, which results in classifiers that are not optimally performing due to limited search in the space of classification models. In particular, this is manifested by application of a specific learning algorithm without consideration of alternatives, or use of parametric learners with unoptimized default values of parameters (i.e., without systematically searching for the best parameters).
Sixteen software systems currently available for supervised analysis of microarray data are identified in Appendix A. However, all of the identified systems have several of the following limitations. First, neither system automatically optimizes the parameters and the choice of both classification and gene selection algorithms (also known as model selection) while simultaneously avoiding overfitting. The user of these systems is left with two choices: either to avoid rigorous model selection and possibly discover a suboptimal model, or to experiment with many different parameters and algorithms and select the model with the highest cross-validation performance. The latter is subject to overfitting primarily due to multiple-testing, since parameters and algorithms are selected after all the testing sets in cross-validation have been seen by the algorithms. (See, Statnikov A, Aliferis C F, Tsamardinos I, Hardin D, Levy S, “A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis.” Bioinformatics, 2005 Mar. 1; 21(5):631-43.) Second, a typical software system either offers an overabundance of algorithms or algorithms with unknown performance. Thus is it not clear to the user how to choose an optimal algorithm for a given data analysis task. Third, the software systems address needs of experienced analysts. However, there is a need to use these systems (and still achieve good results) by users who know little about data analysis (e.g., biologists and clinicians).
There is also a generic machine learning environment YALE that allows specification and execution of different chains of steps for data analysis, especially feature selection and model selection, and multistrategy learning. (See, Ritthoff O, et al., “Yale: Yet Another Machine Leaming Environment”, LLWA 01—Tagungsband der GI-Workshop-Woche Lernen—Lehren—Wissen—Adaptivität, No. Nr. 763, pages 84-92, Dortmund, Germany, 2001.) In particular, this environment allows selection of models by cross-validation and estimation of performance by nested cross-validation. However, the principal difference of YALE with the invention is that YALE is not a specific method but rather a high-level programming language that potentially allows implementation of the invention in the same generic sense that a general-purpose programming language can be used to implement any computable functionality. The existing version of YALE 3.0 is not packaged with the ready-to-use implementation of the invention.
All the above problems are solved by the subsequently described various embodiments of the present invention.