In the biomedical field it is important to identify substances that are indicative of a specific biological state, namely biomarkers. As new technologies of genomics and proteomics emerge, biomarkers are becoming more and more important in biological discovery, drug development and health care. Biomarkers are not only useful for diagnosis and prognosis of many diseases, but also for understanding the basis for development of therapeutics. Successful and effective identification of biomarkers can accelerate the new drug development process. With the combination of therapeutics with diagnostics and prognosis, biomarker identification will also enhance the quality of current medical treatments, thus play an important role in the use of pharmacogenetics, pharmacogenomics and pharmacoproteomics.
Genomic and proteomic analysis, including high throughput screening, supplies a wealth of information regarding the numbers and forms of proteins expressed in a cell and provides the potential to identify for each cell, a profile of expressed proteins characteristic of a particular cell state. In certain cases, this cell state may be characteristic of an abnormal physiological response associated with a disease. Consequently, identifying and comparing a cell state from a patient with a disease to that of a corresponding cell from a normal patient can provide opportunities to diagnose and treat diseases.
These high throughput screening techniques provide large data sets of gene expression information. Researchers have attempted to develop methods for organizing these data sets into patterns that are reproducibly diagnostic for diverse populations of individuals. One approach has been to pool data from multiple sources to form a combined data set and then to divide the data set into a discovery/training set and a test/validation set. However, both transcription profiling data and protein expression profiling data are often characterized by a large number of variables relative to the available number of samples.
Observed differences between expression profiles of specimens from groups of patients or controls are typically overshadowed by several factors, including biological variability or unknown sub-phenotypes within the disease or control populations, site-specific biases due to difference in study protocols, specimens handling, biases due to differences in instrument conditions (e.g., chip batches, etc.), and variations due to measurement error.
Several computer-based methods have been developed to find a set of features (markers) that best explain the difference between the disease and control samples. Some early methods included statistical tests such as LIMMA, the FDA approved mammaprint technique for identifying biomarkers relating to breast cancer, logistical regression techniques and machine learning methods such as support vector machines (SVM). Generally, from a machine learning perspective, the selection of biomarkers is typically a feature selection problem for a classification task. However, these early solutions faced several disadvantages. The signatures generated by these techniques were not reproducible because the inclusion and exclusion of subjects can lead to different signatures. These early solutions were also not robust because they operated on datasets having small sample sizes and high dimensions. Additionally, the signatures generated by these techniques included many false positives and were difficult to interpret in a biological way because neither the technique nor the gene signatures themselves shed any light on the underlying biological mechanisms. Consequently, because they are not reproducible and are difficult to interpret, they may not be especially useful for clinical diagnosis.
More recent techniques involve the integration of knowledge about canonical pathways and protein-protein interactions into gene selection algorithms. Also, several feature selection techniques have been developed, and these include filter methods, wrapper methods and embedded methods. Filter methods work independently of classifier design and perform feature selection by looking at the intrinsic properties of the data. Wrapper and embedded methods perform feature selection by making use of a specific classification model. The Wrapper method uses a search strategy in the space of possible feature subsets, guided by predictive performance of a classification model. Embedded methods make use of the classification model internal parameters to perform feature selection. However, these techniques also face several disadvantages.
Accordingly there is a need for an improved technique for identifying biomarkers for clinical diagnosis, prognosis or both.