1. Field of the Invention
The present invention relates to bioinformatics technologies. More specifically, the present invention relates to the technology of System Reconstruction. The present invention further relates to methods for elucidating metabolic pathways for the identification of novel therapeutic targets and biomarkers using network analysis. The initial seed networks are built from the lists of novel targets for diseases with the high-throughput experimental data being superimposed on the seed networks to identify specific targets.
2. Description of Related Art
The past few years have seen dramatic advances in genomics and other areas of high-throughput biology. The fruits of these accelerated technologies culminated in last-years publication of the human genome. The availability of the DNA sequence of the human genome promises to alleviate much of human suffering from life-threatening diseases. Knowledge of an entire genome may lead to the discovery of new drug targets. Access to the DNA sequence of an individual promises to reduce drug side effects and to allow tailoring medicine to the individual's genetic makeup. Both government agencies and drug companies have invested heavily in these technologies. In return, they expected to vastly reduce the cost and time of drug development, a process costing on average over $500 million in the 1990s and usually spanning over a decade from the initial discovery of drug targets and leads, through validation, optimization, and finally clinical trials.
Currently, these expectations are far from reality because human biology is complex, and there has been no systematic approach to capture this biological complexity. A new field of computational biology has been forged to make sense out of the inordinate amount of genomics data including DNA sequence data, gene expression data, proteomics, metabolomics, and cellomic data. It is believed by many in the industry that the integration of these data alone would quickly lead to the correlation of phenotype (clinical manifestations) with genotype (variations in gene sequence). That goal is still far off, however, as the majority of these data are examined out of context. The basis of a disease cannot be understood without understanding, for example, the alternative splicing forms of the related genes, the proteins for which they code, the complex networks of protein interactions involved, the multiple levels of gene regulation and expression, the correlations between healthy and diseased tissue, the significance of clinical data, and the like. The complexity of human biology requires a systemic understanding of genomic data rather than a shotgun understanding. As a result, the field of systems biology arose and is rapidly becoming a leading approach to understanding human biology.
Recent progress in sequencing technology has generated a vast amount of genomic data. According to the GOLD database, there are more than 300 genomic projects currently completed or under development (wit.integratedgenomics.com/GOLD/). Seventy-nine complete or partially complete genomes are available through the public ERGO system (igweb.integratedgenomics.com/lGwit/). In order to handle this wealth of information, several powerful bioinformatics systems have been developed. The WIT Project was instituted to develop a framework for the comparative analysis of genomic sequence data, focusing largely on the development of metabolic models for sequenced organisms. The analysis of the genomes involves several distinct, but complementary efforts. The first is a determination of open reading frames (ORFs). The second, often called annotation, is the assignment of functions to genes. The third is the creation of functional models for metabolic and regulatory networks of the sequenced genomes, referred to as reconstruction.
Metabolic reconstruction for bacterial and archaeobacterial genomes has been carried out. In contrast, metabolic reconstruction for eukaryotic organisms remains a much more complicated problem. Despite significant progress in genome sequencing, the annotation of eukaryotic genomes remains a complicated problem. Even finding the ORFs, a key component of gene identification, is still a very difficult task. A comprehensive understanding of the complicated structure of eukaryotic genomes will require the integration of sequencing information with genetic, biochemical, structural, and evolutionary data. It will require developing new bioinformatics tools and discovering new algorithms, and, most likely, it will take years of research in both dry and wet labs.
Traditionally, it has not been considered feasible to study metabolism based on expressed sequence tag (EST) data. Such an approach, however, would be very useful for comparative analyses of complex eukaryotic genomes. First, generation of a complete set of ESTs is at least an order of magnitude less expensive than whole genome sequencing. Second, there is a great deal of processed EST data freely available to the scientific community. Currently, there are only a few complete eukaryotic genomes available to the public, but there are sufficient EST data for several dozens of species. Third, and most important, ESTs represent genes that are expressed at specific times in specific tissues. In the present invention, expressed sequence tag data, rather than genomic sequences, were used to reconstruct various aspects of human metabolism.
Several databases exist for collecting EST sequence and expression patterns for eukaryotic genes (for example Unigene EST, dbEST, STACK, SAGE, DOTS, trEST, XREFdb, in addition to a number of tissue-specific databases, such as PEDB). A significant amount of human EST data has already been carefully analyzed, classified, annotated, and mapped to chromosomes. Currently, there are over 1,000,000 human ESTs available in public databases representing 50-90% of all human genes. It is generally believed, however, that EST sequences are inferior to genomic DNA sequences in terms of their quality and degree of representativeness.
Additionally, numerous public and commercial efforts that have focused on characterizing various aspects of general biochemistry and metabolism. Some of these databases include KEGG, BRENDA, SWISS-PROT, EcoCyc, and EMP/MPW. None of these databases, however, focus specifically on humans, or on a single species.
The technology known as Metabolic Reconstruction was developed by Dr. Evgeni Selkov and co-workers at the Argonne National laboratory. Metabolic Reconstruction was developed to study an organism's metabolism by using its genome sequence. A reconstruction of the metabolism of Methanococcus jannaschii from sequence data can be found in Gene, 197, GC11-26.
Cellular life can be represented and studied as the interactome the dynamic network of biochemical reactions and signaling interactions between active proteins. Systemic networks analysis is optimal for integration and functional interpretation of high-throughput experimental data which are abundant in drug discovery yet poorly understood. Composition and topology of complex networks are closely associated with vital cellular functions, which have important implications for life science research. Network theory advances has, in recent years, quickly advanced; and reliable databases of protein interactions for human and model organisms and comprehensive analytical tools have become available. In this application, we present a specific application of networks analysis: identification of novel drug targets by reverse engineering the networks which connect the existing targets for specific disease, followed by superposition of experimental molecular data such as microarray gene expression, proteomics and metabolomics.
Over the last several years known as the post-genomics era, we have seen a paradigm shift in life science research due to the unprecedented scale-up of several laboratory techniques such as automated DNA sequencing, global gene expression measurements, and proteomics and metabolomics techniques. The high throughput (HT) data collectively referred to as OMICs are ubiquitous throughout the drug discovery pipeline from target identification and validation to the development and testing of drug candidates to clinical trials. However, OMICs data is poorly utilized due to the lack of the adequate methods for interpretation in the context of disease and biological function. Although bioinformatics has developed robust statistical solutions for evaluation of the significance and clustering the data points, statistics alone do not explain the underlying biology.
The complexity of human biology requires a system-wide approach to data analysis, which can be defined as the integration of OMICs data using computational methods. The field states that the identification of the parts list of all the genes and proteins is insufficient to understand the whole. Rather, it is the assembly of these parts (the general schema, the modules and elements) and the dynamics of changes in response to stimuli that is truly the key to understanding life, form and function. The assembly of cellular machinery is to be most properly presented as the interactome, the network of interconnected signaling, regulatory and biochemical networks with proteins as the nodes and physical protein-protein interactions as edges. Across many fields of science, technology and social life, the topology and dynamics of complex networks are studied by graph theory. The information about protein interactions has being collected from the vast published experimental data, which is annotated and assembled in the interactions databases. The network data analysis that are now commercially available are robust enough for simultaneous processing of dozens of multi-thousand featured strong data files such as whole-genome expression microarrays. Just recently, researchers in systems biology announced the interpretation of experimental OMICs datasets in the context of accumulated knowledge on human functional networks as the first step in studying complex systems. With this development, the building of the basic framework of databases and logistics can be considered completed. Networks-centered data analysis is now well underway at the major pharmaceutical companies.