1. Field of the Invention
The present invention generally relates to a system and methods for processing biological entities whose expression levels individually or collectively as patterns differentiate biological samples of different phenotypes associated with the presence of absence or severity of conditions or perturbations. More particularly, the present invention relates to a system and methods for processing including detecting and analyzing the collective patterns in biological expression levels for the purposes of selecting variables and deriving multivariate models to predict phenotypic differences in future data.
2. Description of the Related Art
The human genome project and the newly expanded effort into large scale gene expression profiling and protein profiling have generated an unprecedented large amount of data. Commercial companies, government agencies, and academic institutions have over the years invested heavily in the area of bioinformatics to develop infrastructures and technologies for the storage and compilation of data and to provide easy access to the constructed databases. However, in order to actually reap benefits from such amassed data for both scientific and commercial purposes, new and better analytical and computational tools are critically needed to allow researchers and scientists to extract relevant information from the vast amount of data and convert it into forms that are readily usable in academical and commercial research and development activities.
There are two basic approaches to the utilization of information from data: (1) the first principle approach in which current understandings and knowledge of physical laws are used to assemble piece of data and information together to form new knowledge and understandings that explain observed data; and (2) the pattern recognition approach in which regularities and associations among variables in observed data are identified to establish mathematical models that predict future data. These two approaches are closely related and can mutually benefit from each other. Explanations of observed data serve as the scientific foundations for pattern recognition based data modeling and prediction while trends and relationships in data detected through pattern recognition offer directions and generate hypotheses for further research.
Compared to our newly acquired ability to generate exponentially growing volumes of data, our knowledge and understandings in biomedical sciences are very limited, even with the extremely impressive progress we have made over the past decade. For the foreseeable future, pattern recognition based approach may continue to play an important role in the analysis of biomedical data. For commercial product development, such an approach may allow us to bypass some of the current unknowns and instead to establish direct linkage between extracted information from data and an endpoint of interest, such as diagnostic or therapeutics targets.
The development of high density arrays of oligonucleotides or complementary DNAs (microarrays) has made it possible to simultaneously analyze the expression levels of tens of thousands of genes in a single array (experiment). An overview of the current state of art of the technology may be found in (Eisen and Brown, 1998) and (Lockhart and Winzeler, 2000). The abundance of mRNA, measured through optical imaging and numerical post-processing procedures, can be eventually quantitatively represented as a vector of numerical values.
Recent advances in protein arrays and biochips have greatly facilitated protein functional analysis and expression profiling. Such high throughput devices, depending on the protein capturing approaches, measure the abundances of either a pre-selected set of proteins or a large number of unspecified proteins sharing certain common properties so that they are captured together. One of the most common methods of measuring protein abundance is mass spectrometry (MS). The intensity of individual peaks in the mass spectra represents the relative abundances of the corresponding proteins (including fragment of proteins) and peptides. The time-of-fly (TOF) measurements associated with the peaks indicate their molecular weights. Data from such devices, after normalization and calibration and appropriate preprocessing procedures, may also be represented as a vector of numeric values in which each entry is the relative abundance of a particular protein (or protein fragment/peptide) that is either known with its name or labeled by its mass weight.
Expression data are typically characterized by the very large number of measurements (tens of thousands genes in genomic expression data and hundreds or thousands proteins for proteomic expression data) in comparison to the relatively small number of data points (number of experiments).
To extract useful information from such expression data, various analytical and data visualization techniques have been proposed. These techniques center around three aspects of expression data analysis: 1) detection of expression patterns; 2) dimension reduction; and 3) data visualization. Algorithms based on cluster analysis have been used in many reported studies on gene expression profile analyses using microarray data (Eisen et. al., 1998). With a predefined similarity or distance measure, these algorithms reorganize the expression data in such a way that genes sharing similar expression profiles across multiple experiments are arranged together for easy identification of patterns. In a similar fashion, individual experiments with similar expression profiles over the entire set of genes may also be clustered together. For similar purposes, other techniques, such as self-organizing maps, have also been proposed to partition gene expression data into groups of genes and arrays of similar characteristics. The main advantage of these approaches is that it provides a holistic view of expression patterns across the entire set of observations. A noticeable drawback of such approaches is, however, that the majority of genes in the dataset, including those with strong expression variation patterns, might not be associated at all with a particular end point of interest (e.g., different phenotypes or experiment conditions). Consequently, expression patterns that are truly associated with the end point would have to have a strong presence in terms of the number of genes of similar profiles so that they could be detected among the large number of non-contributing expression patterns.
Singular value decomposition has recently been suggested to project expression data onto the so-called eigengene×eigenarray space to reduce dimensionality for better data interpretation (Alter et. al., 2000). The approach is similar to principal component analysis (PCA) in which the major eigengenes correspond to the directions represented by genes with the largest variance in their expression levels. Again, a similar drawback of this method is that those genes with the largest variance in their expression levels might not necessarily be associated with the end point of interest in the analysis.
Effective tools for expression data visualization allow human expert to interactively inspect the data and results during or after various analysis procedures, such as cluster structure from cluster analysis, and projection of individual experiments (arrays) onto a 2 or 3 dimensional space of selected eigenvectors after PCA.
In many expression data studies, one often has a specific endpoint of interest in mind, which could be the desired separation of specimens from subjects of different phenotypes (e.g., normal vs. tumor tissues) or the same type of specimens under different experiment conditions (e.g., yeast cells in normal cell division cycles vs. yeast cells under heat shock stress). For such cases, the purpose is to identify the variables (e.g., genes or proteins) whose expression variation patterns are associated with the different values or conditions of the endpoint of interest.
The identification of differentially expressed genes or proteins typically requires a set of expression data as training data in which the identity (label) of each experiment sample is known before hand. An analytical method that works under such an assumption is commonly referred to as a supervised method. One of the ways to identify differentially expressed genes or proteins is first to use a supervised method to derive a classification model (classifier) that assigns the experiments to a predefined number of known classes with minimum error. The contributions of individual variables to the classification model are then analyzed as a measurement of significance of the genes or proteins whose expression levels collectively as co-regulated patterns differentiate the different classes of experiments.
There are two fundamentally different approaches to the derivation of classification models. With the traditional statistical approach, the training data are used to estimate the conditional distributions for each of the classes of experiments. Based on the Bayes decision rule, a final classifier is then determined. A simple example of this approach is the Linear Discriminant Analysis (LDA) method (Fisher 1923). In LDA, the training data from two predefined classes are used to estimate the two class means and a pooled covariance matrix. The means and covariance matrix are then used in determine the classification model. It can be shown using the Bayes' decision rule that if the data are conditionally normally distributed and share the same covariance structure, LDA is the optimal classification models for separating the two classes of data.
The other approach to the derivation of classification models is called empirical risk minimization. In this approach, the model is determined directly by minimizing a predefined empirical risk function that is linked to the classification error of the model over the training data. The Support vector machine (SVM) (Vapnik, 1998) is one such method. In SVM, in addition to the minimization of empirical risk, it has also control over the complexity of the model to partially overcome the problem of over-fitting the training data.
The two approaches work differently and one could be more appropriate than the other for certain problems and vice versa for others. In general, the two approaches all produce good results for problems with a sufficiently large training data set. However, for problems with a very small training data set, both approaches are constrained in their way of utilizing the information contained in the limited number of samples.
In the traditional statistical approach, training data, whether they are located close to the boundaries between pairs of classes or far away from the boundaries, contribute equally to the estimation of the conditional distributions from which the final classification model is determined. Since the purpose of classification is to recover accurately the actual boundaries that separate the classes of data, training samples close to the separating boundaries should play a more important role than those samples that are far away. Using clinical diagnostic problems as an example, specimens from patients whose are borderline cases such as early stage diseases and benign cases should be more useful in defining precisely the disease and non-disease classes than those from patients with late stage diseases or young healthy controls.
Using the empirical risk minimization approach, on the other hand, the final classification model is largely determined based on the training data that are close to the class boundaries. The solution from SVM, for example, is determined exclusively by a subset of the training samples located along the class boundaries (support vectors). The overall data distribution information, as partially represented by the total available training samples, is ignored.
For problems with a sufficiently large number of training samples, asymptotically, both approaches will work well. A large set of training samples will allow for the precisely estimation of the conditional distributions including along the side that different classes separate from one another so that classifiers based on Bayes' decision rule will perform optimally; the empirical risk minimization approach will also be able to define a precise classification model based on training samples that representatively covers the spaces along all the boundaries.
For problems with a limited number of training samples such as biological expression data analysis, however, neither of the two approaches is particularly efficient in utilizing the information from the training data.
Therefore, there is a critical need for methods and systems that take advantages from both the traditional statistical approach and the empirical risk minimization approach, and provide a quantitative mechanism to incorporate prior knowledge into the data analysis process.