Recent advances in biological experimental techniques have resulted in a dramatic increase in the complexity and quantity of data that is being generated and the technology for generating data has outpaced the technology for helping scientists comprehend the new information contained in the data. For example, gene expression profiling, metabolic profiling, protein expression profiling and automated cell imaging techniques have created an explosion of data that is difficult to interpret, as the number of variables being measured and the sheer quantity of data generated renders manual analysis impractical if not impossible.
Given a large set of measurements on compounds, genes, proteins, etc., it is challenging to locate the subset of most interest in a particular experiment. Several statistical methods exist for analyzing such data that include examining the measurements one at a time and choosing all of the measurements that are statistically significant, using discriminant analysis (or some other classification procedure) alone to choose the set of measurements that distinguish between experimental groups, and using prior knowledge of the treatment mechanism to look for expected and inferred perturbations. None of the foregoing methods are fully effective in that they generally provide too much information to the user, or not enough.
The present invention addresses the problems associated with the analysis of complex datasets by providing methods that enable identification of a subset of data within a larger dataset that is of most interest for further analysis and identification of a subset of data that best distinguishes between experimental groups. Useful applications of the present invention include the discovery of biomarkers, disease targets and mode of action, and therapeutic chemical entities.