Recent advances in biotechnology, such as the sequencing of the human genome, have increased the need for information on how various encoded gene products, or proteins, mediate the biological processes that either contribute to health, or cause diseases. Standard molecular biological techniques study these processes at the genomic level, but do not provide information at the protein level. The growing field of proteomics research involves the search for targets or biomarkers for drug discovery and development, as well as to provide information that can be used to diagnose disease.
Comprehensive system-wide biomarker discovery has been made easier by the advent of large-scale analytical methods such as DNA microarray technology, high-throughput mass spectrometry (MS) and other techniques used to study complex biological systems. Statistical and machine-learning methods have also been developed, allowing the study of very large datasets produced by high-throughput protein analysis methods.
High throughput MS is a powerful technique in biomarker discovery. However, the use of this technique is complicated by a number of factors. Biological samples are very complex, and often contain hundreds to thousands of compounds, and analysis of these samples can often be difficult. For example, the differential comparison of LC-MS data from different biological samples generates complex datasets, and presents significant data processing challenges. The analysis is time-consuming and there is often significant noise and variability that is not properly accounted for. Current methods to eliminate noise and detect mass spectral peaks use an ad hoc approach, and do not use any a priori or learned information with regard to peak shape, retention time, or relationship among peaks. Statistical methods used to subtract background and reduce noise often remove relevant information in addition to filtering out noise and irrelevant information. The resulting data sets are not suitable for downstream analysis during biomarker discovery.
Therefore, there is a need for methods to analyze complex MS data sets that will incorporate richer qualitative information and thereby improve biomarker analysis. One way to address these challenges is by using a software module that contains a means for a priori partitioning of features, such that irrelevant features are filtered out before performing differential analysis of the data, while preserving relevant features for later analysis. If molecular features corresponding to specific chemical properties can be extracted in a fast and efficient manner, the data obtained can be used to make a powerful bioinformatics system.