Recent progress in sequencing technology has generated a vast amount of genomic data. According to the GOLD database, there are more than 300 genomic projects currently completed or under development. Seventy-nine complete or partially complete genomes are available through the public ERGO system. In order to handle this wealth of information, several powerful bioinformatics systems have been developed. The WIT Project was instituted to develop a framework for the comparative analysis of genomic sequence data, focusing largely on the development of metabolic models for sequenced organisms. The analysis of the genomes involves several distinct, but complementary efforts. The first is a determination of open reading frames (ORFs). The second, often called annotation, is the assignment of functions to genes. The third is the creation of functional models for metabolic and regulatory networks of the sequenced genomes, referred to as reconstruction.
Metabolic reconstruction for bacterial and archaeobacterial genomes has been carried out. (E. Selkov et al., Proc. Natl. Acad. Sci. U.S.A. 2000 Mar. 28; 97(7):3509-14). In contrast, metabolic reconstruction for eukaryotic organisms remains a much more complicated problem. Despite significant progress in genome sequencing, the annotation of eukaryotic genomes remains a complicated problem. Even finding the ORFs, a key component of gene identification, is still a very difficult task. A comprehensive understanding of the complicated structure of eukaryotic genomes will require the integration of sequencing information with genetic, biochemical, structural, and evolutionary data. It will require developing new bioinformatics tools and discovering new algorithms, and, most likely, it will take years of research in both dry and wet labs.
In contrast, a good deal of information about the sequences and expression patterns of eukaryotic genes has been accumulated in numerous databases of expressed sequence tags (ESTs). (See, for example Unigene EST, dbEST, STACK, SAGE, DOTS, trEST, XREFdb, in addition to a number of tissue-specific databases, such as PEDB.) A significant amount of human EST data has already been carefully analyzed, classified, annotated, and mapped to chromosomes. Currently, there are over 1,000,000 human ESTs available in public databases representing 50-90% of all human genes. (Electrophoresis, 1999, Feb. 20(2):223-9). It is generally believed, however, that EST sequences are inferior to genomic DNA sequences in terms of their quality and degree of representativeness.
The technology known as Metabolic Reconstruction was developed by Dr. Evgeni Selkov and co-workers at the Argonne National Laboratory. Metabolic Reconstruction was developed to study an organism's metabolism by using its genome sequence. (Selkov, et al., (1997) A reconstruction of the metabolism of Methanococcus jannaschii from sequence data. Gene, 197, GC11-26).
Traditionally, it has not been considered feasible to study metabolism based on EST data. Such an approach, however, would be very useful for comparative analyses of complex eukaryotic genomes. First, generation of a complete set of ESTs is at least an order of magnitude less expensive than whole genome sequencing. Second, there is a great deal of processed EST data freely available to the scientific community. Currently there are only a few complete eukaryotic genomes currently available to the public, but there are sufficient EST data for several dozens of species. Third, and most important, ESTs represent genes that are expressed at specific times in specific tissues. In the present invention, expressed sequence tag (EST) data were used rather than genomic sequences.