Complete genomic sequence information is now available for a wide range of organisms. Consequently, the specific function of these organism's genes can be studied using a variety of information dense, high-throughput genomic analysis methods, for example, polynucleotide arrays. These arrays provide vast amounts of gene expression data corresponding to the differential abundance of specific mRNA transcripts in related biological samples. For example, transcript abundance may be compared in tissue samples from in vivo compound-treated animals as described in US application 2005/0060102 A1, published Mar. 17, 2005.
Gene expression data obtained using polynucleotide arrays are often associated with multiple dimensions. In some instances, the number of dimensions can correspond to the number of genes for which measurements are made, a number which is often in the thousands. In analyzing these vast amounts of multi-dimensional data, techniques are desirable for analysis and interpretation of the data. In particular, it is desirable to develop techniques to classify and identify relationships in multidimensional biological data. Various techniques for analyzing multi-dimensional biological data have been described. For example, WO 03/072065 describes methods for deriving signatures from large chemogenomic datasets using principal component analysis. Natsoulis et al. describe several methodologies for deriving linear classifiers from large chemogenomic datasets wherein the classifiers provide interpretable drug signatures with high classification performance (Natsoulis et al., Genome Res. May; 15 (5):724-36 (2005); see also: WO 2005/017807; and El-Ghaoui et al., Report # UCB/CSD-03-1279. Computer Science Division (EECS), University of California, Berkeley, Calif. (2003)). Bhattacharyya et al. describe a statistical approach for generating a linear classifier from expression profile data and identifying a small number of relevant features simultaneously (Bhattacharyya et al., Signal Processing 83: 729-743 (2003); see also, Bhattacharyya et al., J Comput Biol. 11 (6):1073-89 (2004)). U.S. Pat. No. 6,882,990 describes methods and systems for identifying patterns in biological datasets using multiple support vector machines.
Key to the usefulness of any biological classifier is its ability to prevent or minimize any false positive or false negative results. However, because biological datasets used to derive and train classifiers are typically highly unbalanced (i.e., including many true negatives and just a few true positive samples) the standard classification techniques often result in classifiers with low accuracy when confronted with actual test data. Notwithstanding the prior described methods, there remains a significant need for robust yet simple classifiers that accurately predict a biological activity or a biological state (e.g., a disease diagnosis) based on non-ideal biological data.