In many systems or phenomena, naturally occuring or otherwise, distinctive patterns of data are often buried within the highly complex data sets that are created to characterize such systems or phenomena. Such patterns have been observed, for example, in the study of a wide variety of systems and phenomena such as diseases, environmental conditions, and financial conditions, to name a few.
The distinctive patterns of data that may characterize certain conditions are often not obvious or apparent using existing classification methods and systems. The current classification systems and methods typically find or uncover a single known differentiating feature between sets of data or analyze only a subset of the data. A hidden pattern found in one dataset is generally not applicable to another dataset. That is, these systems generally require “retraining” on each new set of data and cannot completely characterize the dataset.
For example, current classification systems cannot effectively screen for early stage ovarian cancer. In its early stages, ovarian cancer is an insidious disease, exhibiting essentially no symptoms. Ovarian tumors may grow to a size of about 10-12 cm before impinging on adjacent organs, resulting in symptoms, such as increased urinary frequency and rectal pressure. More than 80% of ovarian cancer patients currently are diagnosed at a late clinical stage as a result of the absence of early stage symptoms and the associated 5-year survival rate is only 35%. If ovarian cancers are diagnosed at an early stage, however, the 5-year survival rate is more than 90% because, in most cases, the cancer can then be eradicated completely by surgery.
An estimated 25,000 women are diagnosed with ovarian cancer annually in the United States and approximately 14,500 women die from the disease each year. An effective screening program for early stage ovarian cancer has been elusive, however, due to factors which include the lack of a highly specific screening test. With the rapid development of proteomics, new screening strategies utilizing modern proteomic technology and bioinformatics are emerging, but none have shown sufficient specificity to be an effective diagnostic tool.
The PROTEIN CHIP array surface-enhanced laser desorption-ionization (SELDI) mass spectrometry (MS) system, available from Ciphergen Biosystems of Fremont, CA, USA, is increasingly recognized as the leading technology for fast and reliable protein profiling based on tissue or body fluid samples. The underlying principle of SELDI is surface-enhanced affinity capture through the use of specific probe surfaces or protein chips. Once captured on the SELDI protein chip array, proteins are detected through the ionization-desorption, time-of-flight mass spectrometry process. The PROTEIN CHIP SELDI-MS has been useful in identifying known markers of prostate cancer and in discovering potential markers which are over- or under- expressed in prostate cancer cells and body fluids.
By comparing serum proteomic spectra of early stage ovarian cancer patients with a comparable group of unaffected women using a bionformatics algorithm, a recent study has identified a set of proteomic markers and has been able to classify subjects with a sensitivity of 100% and a specificity of 95%. See Petricoin et al. “Use of Proteomic Patterns in Serum to Identify Ovarian Cancer,” The Lancet, vol. 369, pp. 572-577 (Feb. 16, 2002), incorporated herein by reference in its entirety and hereinafter referred to as the “Lancet Paper;” and U.S. Published patent application Ser. No. 2003/0004402 A1, entitled “Process for discriminating between biological states based on hidden patterns from biological data,” also incorporated herein by reference in its entirety.
Shortly after the publication of the Lancet Paper, using the same set of subjects and an improved protein surface, Petricoin et al. achieved better results with a sensitivity of 100% and a specificity of 97%. See “Correspondence,” The Lancet, vol. 360, pp. 169-171 (Jul. 13, 2002), incorporated herein by reference in its entirety and hereinafter referred to as the “Correspondence.” The corresponding proteomic mass spectrum data set (referred to as “ovarian data set 4-3-02”) is publicly available at the NTH/FDA Clinical Proteomic Program Databank website.
While applauding their accomplishment, many remained skeptical about the screening value of the method described by Petrocoin et al. in the Correspondence. In fact, Petrocoin et al. stated that the prevalence of ovarian cancer in postmenopausal women is 1 in 2,500, which means that a screening assay with 97% specificity would result in 75 false positives for every true positive identification.
There are several statistics based, analytical tools that have been developed to analyze mass spectra of protein marker expression for various disorders. The genetic algorithm first described by John Holland in the mid-1970s manipulated complex data sets as individual elements through a computer-driven analog of a natural selection process. In 1982, Kohonen proposed a cluster analysis method by using a self-organizing map. Correlogic Systems, Inc. of Bethesda, Maryland has combined the ideas of Holland's genetic algorithm and Kohonen's self-organizing map to implement a pattern discovery algorithm in a software named PROTEOME QUEST (genetic algorithm and self-organizing map software for implementing pattern discovery), Beta version 1.0. Petricoin et al. utilized the PROTEOME QUEST (genetic algorithm and self-organizing map software for in pattern discovery) software to analyze the proteomic spectra generated by SELDI-TOF, to identify ovarian cancer. Petricoin et al. adopted a random window approach to sequentially select markers and to examine their contribution towards the classification rate.
A drawback to Petricoin et al.'s approach is that only portions of the proteomic spectra are used for the analysis, in which case the contribution of each marker may vary with the window size and significant protein markers may be excluded from the analysis. Conversely, many of the biomarkers predicted by such known methods will not be statistically significant, so that in many cases, efforts to determine the underlying molecular identity and subsequent cell and molecular biology will be fruitless.
For the large-scale screening for the presence of early cancer, specificity and sensitivity must approach 100% to assure no disease is missed and to prevent pursuit of unnecessary additional diagnostic procedures. Similarly, biomarker identification and molecular characterization require a high degree of reproducibility and fidelity for each individual proteomic marker.
Another challenge when analyzing proteomic data (or genomic data, in general) is to draw robust conclusions from high-dimensional data based on relatively few subjects. The question is how robust this conclusion is. In addition to providing 100% accuracy and specificity when analyzing a particular testing set, a robust analysis method should be able to determine discriminating markers that can be trusted to accurately diagnose any random subject from relevant population at large. In other words, 100% accuracy and specificity should apply to the full population.
In light of the known approaches and their limitations, it is clear that improved analysis methods and systems are required. Moreover, it is desirable to have a comprehensive method and system that considers each data point of the dataset to discover hidden patterns or markers, thereby enabling the method and system to detect subtle differences in multiple datasets without retraining on each dataset. Such a system and method trained, for example, to detect a toxin from an environmental dataset can be used on another environmental dataset to detect that toxin without retraining on the other environmental dataset.