The present invention is related to techniques for processing gene expression data and predicting gene relationships.
Since ancient time, humans have been searching for early diagnosis and effective treatment of diseases. In the last several hundred years, medical techniques, including blood sample analysis and physical imaging analysis, greatly improved the ability for diagnosis and treatment. But no cure has yet been found for many deadly and debilitating diseases, including cancer. To further improve humans' physical health, gene diagnosis and gene therapy have been proposed and experimented. The human genome sequencing has been largely completed and the focus of medical research has been shifted to unveiling functions of various genes and their relations.
Relationship of genes may be inferred from their expression data, and may be organized in various forms, including clustering, dendrogram, and relevance network. Under the clustering method, genes are grouped into clusters based on their proximity in a multi-dimensional expression space, as measured by, among others, Euclidean distance, linear correlation, and non-linear correlation. Consequently, all genes are organized into a hierarchical structure. In contrast, under the dendrogram method, genes are comprehensively compared with a metric, and then added to a binary tree in order of decreasing correlation such that pairs with the highest correlation are closest in the tree. Both the clustering method and the dendrogram method suffers from several drawbacks. First, the clustering method based on Euclidean distance cannot handle genes with missing data, because incomplete expression vectors for these genes cannot be accurately oriented in the multi-dimensional space. Second, the dendrogram method and the clustering method based on Euclidean distance cannot identify negatively correlated genes and therefore ignore some important biological relations. Third, many clustering methods do not allow genes to belong to multiple clusters, and thus cannot accurately describe genes that are under the control of two or more different regulatory factors. Similarly, the dendrogram method has the same problem.
Under the relevance network method, a probability function, such as that based on mutual information or combinatorics, is used to estimate the probability that genes are independent. If the probability is low, the genes are predicted as having significant relation. Consequently, they are placed as connected in distinct gene networks with varying number of elements. Unlike the dendrogram method and many clustering methods, the relevance network method incorporates only genes with significant relations. In addition, a probability function is non-metric, so that the probability that two genes are related does not need to satisfy the “triangle inequality.” Under the “triangle inequality,” the distance, such as Euclidean distance, between genes A and B cannot exceed the sum of distance between genes A and C and distance between genes B and C. This requirement limits the ability of the metric method to describe gene relationships. Genes A and B may be weakly related or unrelated, even though both genes are regulated in part by the same gene C. Hence distance between genes A and C and distance between genes B and C may be small, while distance between genes A and B is large, exceeding the sum of distance between genes A and C and distance between genes B and C. Such relationships among genes A, B, and C cannot be adequately described by the metric method, but a non-metric method such as the relevance network using a probability function can provide an adequate description.
Some relevance network methods utilize a 2×2 matrix to calculate the probability that genes are independent, as exemplified in Walker, M. G. et al., Prediction of Gene Function by Genome-Scale Expression Analysis: Prostate Cancer-Associated Genes, Genome Research 9(12): 1198-1203, 1999, Walker, M. G. et al., Pharmaceutical Target Discovery Using Guilt-by-Association: Schizophrenia and Parkinson's Disease Genes, Proceedings of the International Conference on Intelligent Systems for Molecular Biology 146: 282-286, 1999, and Walker, M. G., Drug Target Discovery by Gene Expression Analysis: Cell Cycle Genes, Current Cancer Drug Targets 1(1), 2001. These methods use binary expression data representing presence or absence of genes in a particular cell sample, and analyze gene relations based on presence or absence of genes in a common set of cell samples. These methods do not study continuous expression data representing regulatory effect of genes, but these data contain important information on gene relations.
As an improvement, Liang proposed a relevance network method that discretizes continuous gene expression data into binary states, i.e., on or off states of regulatory effect, as described in Liang S., Reveal, A General Reverse Engineering Algorithm for Inference of Genetic Network Architectures, Pacific Symposium on Biocomputing 3:18-29 (1998). This method uses the binary representation for genes, which does not fully represent richness of expression data. Expression data may show that genes are up regulated, down regulated, or unchanged.
In contrast, Butte et al. used a relevance network method that discretizes the continuous expression data representing gene regulatory effect into n sub-ranges, e.g., 10 sub-ranges, as described in Butte A. J. el al., Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements, in Altman, R., Dunker, K., Hunter L., Lauderdale K., Klein T. eds., Pacific Symposium on Biocomputing, at 418-429, (2000), Hawaii, World Scientific. This method uses narrow sub-ranges; therefore it cannot effectively filter out measurement noises that are associated with gene expression data, and may yield inaccurate gene correlation.