The past few years have seen dramatic advances in genomics and other areas of “high-throughput” biology. The fruits of these accelerated technologies culminated in last-year's publication of the human genome. (Venter et al., (2001) The sequence of the human genome, Science. 291: 1304-1351.) The availability of the DNA sequence of the human genome promises to alleviate much of human suffering from life-threatening diseases. Knowledge of an entire genome may lead to the discovery of new drug targets. Access to the DNA sequence of an individual promises to reduce drug side effects and to allow tailoring medicine to the individual's genetic makeup. Both government agencies and drug companies have invested heavily in these technologies. In return, they expected to vastly reduce the cost and time of drug development, a process costing on average over $500 million in the 1990s and usually spanning over a decade from the initial discovery of drug targets and leads, through validation, optimization, and finally clinical trials.
Currently, these expectations are far from reality because human biology is complex, and there has been no systematic approach to capture this biological complexity. A new field of computational biology has been forged to make sense out of the inordinate amount of genomics data—including DNA sequence data, gene expression data, proteomics, metabolomics, and cellomic data. It is believed by many in the industry that the integration of these data alone would quickly lead to the correlation of phenotype (clinical manifestations) with genotype (variations in gene sequence). That goal is still far off, however, as the majority of these data are examined out of context. The basis of a disease cannot be understood without understanding, for example, the alternative splicing forms of the related genes, the proteins for which they code, the complex networks of protein interactions involved, the multiple levels of gene regulation and expression, the correlations between healthy and diseased tissue, the significance of clinical data, and the like. The complexity of human biology requires a systemic understanding of genomic data rather than a shotgun understanding. As a result, the field of systems biology arose and is rapidly becoming a leading approach to understanding human biology.
There are a number of public and commercial efforts that have focused on characterizing various aspects of general biochemistry and metabolism. Some of these databases include KEGG (Kanehisha et al., (2002) The KEGG databases at GenomeNet, Nucleic Acids Res., 30: 42-46); BRENDA (Schomburg et al., (2002) BRENDA, Enzyme data and metabolic information, Nucleic Acids Res., 30: 47-49); SWISS-PROT (Bairoch and Apweiler, (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000, Nucleic Acids Res. 28: 45-48); EcoCyc (Karp et al., (2002) The EcoCyc Database, Nucleic Acids Res. 30: 56-8); and EMP/MPW (Selkov et al., (1998) MPW: the Metabolic Pathways Database, Nucleic Acids Res., 26: 43-45). None of these databases, however, focus specifically on human, or on a single species.
The technology known as Metabolic Reconstruction was developed by Dr. Evgeni Selkov and co-workers at the Argonne National Laboratory. Metabolic Reconstruction was developed to study an organism's metabolism by using its genome sequence. (Selkov, et al., (1997) A reconstruction of the metabolism of Methanococcus jannaschii from sequence data, Gene, 197, GC11-26).
Traditionally, it has not been considered feasible to study metabolism based on EST data. Such an approach, however, would be very useful for comparative analyses of complex eukaryotic genomes. First, generation of a complete set of ESTs is at least an order of magnitude less expensive than whole genome sequencing. Second, there is a great deal of processed EST data freely available to the scientific community. Currently, there are only a few complete eukaryotic genomes available to the public, but there are sufficient EST data for several dozens of species. Third, and most important, ESTs represent genes that are expressed at specific times in specific tissues. In the present invention, expressed sequence tag data, rather than genomic sequences, were used to reconstruct various aspects of human metabolism.