Microarray and gene-chip technologies provide an approach for characterizing transcriptional properties of thousands of genes and studying their interactions simultaneously under many different experimental conditions. However, in many applications the key problem has been statistical noise in the transcriptional data, varying from experiment to experiment and attributable to non-specific hybridization, cross-hybridization, competition, diffusion of the target on the surface, base-specific structural variations of the probe, etc. A better understanding of this noise can come from the kinetic analysis of the base-pairing, denaturing, and diffusion processes. However, in the absence of detailed knowledge to deconvolve the measurement data, it is hard to distinguish properly between specific clusters of genes, based on expression intensities data. The purpose of identification (combined with normalization) methods is to compare expression intensities from multiple experiments, and distinguish between a stable subset of genes whose behaviors could be expected to be already well-modeled (so-called housekeeping genes, rank-invariant genes, or genes with constant expression), and a subset of genes deviating from the stable model (so-called non-housekeeping genes, regulated genes or differentially expressed genes). See Yang et al., 2002, Proc. Natl. Acad. Sci. USA 100(3):1122-1127.
The identification process creates a statistical model of the “main bulk” of the genes (i.e., the stable subset) either through a global statistical analysis of transcriptional expression intensities of all the data, or through a local statistical analysis of similar statistics as a function of the expression range. The genes deviating from the statistics computed via initial identification are then subjected to further analysis to determine their biological characteristics in response to the experimental condition. See e.g. Bolstad et al., 2003, Bioinformatics 19(2):185-93.
There is a need for methods and systems that can identify differentially expressed genes from expression data in a data set, particularly from a data set containing data regarding genes expressed under different conditions. Such methods may also be useful for identifying outlying points in any type of statistical data set, where the identified outlying point may represent a meaningful distinction rather than statistical noise.