Systems biology is an important and relatively new research field that is generating massive quantities of oftentimes disparate data. Interest in studying biology from a systems perspective has grown as the assumption of a linear relationship between gene and function has been recognized to be overly simplistic, at best. A “cause-and-effect” relationship between a single gene, its product, and a phenotype (or disease state) is the exception, not the rule. Some highly successful biopharmaceutical products, including insulin and erythropoietin, operate through their ability to modulate such linear relationships. However, problems such as ligand redundancies and cell-type specificities obfuscate the development of a pharmaceutical or agricultural product. To further complicate matters, many systems operate through nonlinear dose dependencies. In other words, at one concentration a compound may have one effect (such as an anti-inflammatory effect), while at a different concentration in the same cell type the compound may have an opposite effect (such as a pro-inflammatory effect). Issues of ligand redundancy, cell-type specificity, and nonlinear dose dependency are difficult to reconcile in a product development environment, even in cases where gene function is known or predictable. To further complicate matters, many diseases are polygenic, so not only do multiple gene products require identification, but alternate treatment compounds are likely required to address the role each gene product plays in a disease process. M. Khodadoust and T. Klein (2001) Nature Biotech. 19:707.
For years it was assumed that gene function was determinable by obtaining a gene sequence and performing a homology-based comparison. The central dogma is that similar sequence equals similar structure that equals similar function. Gene annotations found in public databases are far from infallible and over-reliance on them may misdirect research efforts. In many cases, only a very small percentage of any given genome is actually experimentally annotated. Homology sequence comparisons and blanket application of the central dogma supply the remaining annotation. While amino acid identity greater than 40 percent of two complete protein sequences infers structural similarity, it does not necessarily infer functional similarity. Additional sequence conservation in an active site region is required for accurate prediction of function. Wilson et al. (2000) J. Mol. Biol. 297:233-249. Proteins are typically organized into families based on the similarity of three-dimensional structures. In some cases, members of the same protein family may have no detectable sequence similarity, illustrating that structural similarities do not necessarily impute sequence similarities, and vice versa. Current annotation available from public sources is largely incomplete, and as a result, sequence comparison is not a viable approach to determining the relative roles of genes sequenced in genomics projects.
To meet the challenge of understanding complex biological systems, scientists require the ability to analyze complex data sets. As noted above, the sequencing of entire genomes has not led to an industry pipeline bulging with new life sciences products, nor has it led to an understanding of the function of all the sequenced genes. Currently, less than 5 percent of genes with annotation available from a public database are sufficiently well annotated for the information to be used directly in the development of products. As a result, a number of research technologies, such as gene expression profiling, metabolite analysis, phenotypic profiling, proteomics, 3-D protein structural analysis, protein expression, identification of biochemical pathways or networks, genotyping (including polymorphisms) and scientific literature tools are under development to help identify gene function. Each technology has its strengths and weaknesses but all are necessary, as no single existing technology is sufficient to identify the function of all genes.
To meet the challenge of analyzing large, complex data sets, researchers require tools that organize, analyze and present data in a meaningful way. Especially helpful are pattern recognition tools, and tools that place data in a biological context. Presented herein are computational methods and systems for identifying, from data, network patterns in complex biological systems.