The application of genomics to life science industries promises to change the way pharmaceutical, agricultural, and biotechnology companies operate, saving significant amounts of time and money in the development of new and efficacious products. The original core concept of genomics research was that obtainment of a genomic sequence of an organism would lead directly to identification of every gene in the organism and an unambiguous determination of the function of each identified gene. Assumptions serving as a foundation for the conceptualized promise of genomic research are reliant upon two basic tenets. First, a basic paradigm of molecular biology is that each gene encodes one protein having one function. Second, it is assumed that by performing homology-based sequence comparisons, scientists can identify the function of most genes based on the sequence information available from public databases. Unfortunately, both of these assumptions have faults and as a result, the genomics era has yet to provide an accelerated route from gene discovery to blockbuster product. An additional complicating factor in the study of biological systems is that protein function is often defined in the context of a given situation, i.e. through interactions with other proteins and within specific cellular and subcellular compartments.
The assumption of a linear relationship between gene and function is now being recognized as overly simplistic, at best. A “cause-and-effect” relationship between a single gene, its product, and a phenotype (or disease state) is the exception, not the rule. Some highly successful biopharmaceutical products, including insulin and erythropoietin, operate through their ability to modulate such linear relationships. However, problems such as ligand redundancies and cell-type specificities obfuscate the development of a pharmaceutical or agricultural product. To further complicate matters, many systems operate through nonlinear dose dependencies. In other words, at one concentration a compound may have one effect (such as an anti-inflammatory effect), while at a different concentration in the same cell type the compound may have an opposite effect (such as a pro-inflammatory effect). Issues of ligand redundancy, cell-type specificity, and nonlinear dose dependency are difficult to reconcile in a product development environment, even in cases where gene function is known or predictable. To further complicate matters, many diseases are polygenic, so not only do multiple gene products require identification, but alternate treatment compounds are likely required to address the role each gene product plays in a disease process. M. Khodadoust & T. Klein, 19 NATURE BIOTECH. 707 (2001).
For years it was assumed that gene function was determinable by obtaining a gene sequence and performing a homology-based comparison. The central dogma is that similar sequence equals similar structure that equals similar function. Gene annotations found in public databases are far from infallible and overreliance on them may misdirect research efforts. In many cases, only a very small percentage of any given genome is actually experimentally annotated. Homology sequence comparisons and blanket application of the central dogma supply the remaining annotation. While amino acid identity greater than 40 percent of two complete protein sequences infers structural similarity, it does not necessarily infer functional similarity. Additional sequence conservation in an active site region is required for accurate prediction of function. Wilson et al., 297 J. MOL. BIOL. 233-249 (2000). Proteins are typically organized into families based on the similarity of three-dimensional structures. In some cases, members of the same protein family may have no detectable sequence similarity, illustrating that structural similarities do not necessarily impute sequence similarities, and vice versa. Current annotation available from public sources is largely incomplete, and as a result, sequence comparison is not a viable approach to determining the relative roles of genes sequenced in genomics projects.
To meet the challenge of understanding complex biological systems, scientists require the ability to analyze complex data sets. As noted above, the sequencing of entire genomes has not led to an industry pipeline bulging with new life sciences products, nor has it led to an understanding of the function of all the sequenced genes. Currently, less than 5 percent of genes with annotation available from a public database are sufficiently well annotated for the information to be used directly in the development of products. As a result, a number of research technologies, such as gene expression profiling, metabolite analysis, phenotypic profiling, proteomics, 3-D protein structural analysis, protein expression, identification of biochemical pathways or networks, genotyping (including polymorphisms) and scientific literature tools are under development to help identify gene function. Each technology has its strengths and weaknesses and no single existing technology is sufficient to identify the function of all genes.
Since no single technology is the answer to gene function identification, the challenge is to combine data from different technology types in resultant data sets that are meaningful. Unfortunately, combining data from various sources is wrought with substantial technical problems in data organization and data analysis. Research technology systems organize data in different ways. Different research technologies use different analysis tools, which ask conceptually different questions. Analysis tools used in association with different technologies can provide dissimilar and even contradictory conclusions with respect to gene function and other data end points. It seems likely that for the majority of genes, the identification of function will only become possible if data from a variety of sources and technologies are organized as a single, logical data set. That is, the potential of multi-technology genomic research has not yet been realized because there is no common currency for integration and analysis of large quantities of heterogeneous data. Thus, there exists a need for the development of a meaningful way to produce and analyze multi-technology-derived data to provide scientists with yet untapped knowledge to aid in the development of new and efficacious agricultural, pharmaceutical, forensic, and nutriceutical products.