The present invention pertains to computational biology, post-genomic informatics, structural proteomics and functional proteomics. The invention uses evolutionary computation approaches to design and select simulation scenarios of protein-protein interactions for functional proteomic modeling.
Prior art patent applications that apply to the present invention mainly involve structural proteomics mapping, protein pathway discovery mapping and specific disease application protein mapping.
In Rzhetsky (molecular interaction network prediction), U.S. patent application publication number 20030068610, Palsson (operational reaction pathway identification), U.S. patent application number 20040072723, Heal (protein sequence interaction rule prediction), U.S. patent application number 20030059844, and Gustafsson (functional biomolecule identification), U.S. patent application number 20040072245, systems are developed to identify structural protein relationships. Unknown molecular interactions, protein sequence activity relationships and protein reaction pathways are mapped using computational methods involving data search space development, probabilistic analysis, comparison analysis or rule prediction. These approaches are limited to structural proteomics mapping.
Lett (image-based biological simulations), U.S. patent application number 20030018457, teaches a method to simulate structural protein image data in time series to modify model predictions. Ramnarayan (structural protein modeling of polymorphisms for drug design), U.S. patent application number 20030158672, compares healthy and mutant structural protein 3-D modeling for pharmacogenomics drug design. These patent applications model 3-D or time series data but are limited to isolated proteins' structures.
Liu (neurological disorder inhibitor), U.S. patent application number 20020006606, presents a model to inhibit JNK and MLK kinase activity to prevent neuronal cell death in neurodegenerative disease. This approach does not model the process of protein function in this specific disease application to show how the proposed therapy is effective.
Most of the research history involving the technologies of the present system—including structural protein prediction, protein pathway prediction, protein model generation, SNP identification, personalized medicine and evolutionary computation—is represented in the academic literature described below.
The development of proteomics is fairly recent. The massive data sets derived from the human genome present a vast treasure of information about proteins. Theorists from biology and chemistry have built models in which the genetic data are useful for understanding individual protein structures. Data about the structure of individual proteins are input into a multiplicity of protein databases. These databases include the Berkeley Structural Genomics Center, Joint Center for Structural Genomics, Oxford Protein Production Facility, Protein Structure Factory and Structural Proteomics in Europe. In addition to structural proteomic (SP) data collection resources, there are a number of protein interaction databases: the Biomolecular Interaction Network Database, the Database of Interacting Proteins, The General Repository for Interacting Datasets, the Human Protein Interaction Database and the Human Protein Reference Database. These databases generally input protein information collected by biomolecular researchers. But the problem emerges of how to organize this vast data reservoir in order to improve our understanding of protein processes.
Much research in bioinformatics is directed to the prediction of protein structures from raw protein data. The goal here is to model individual proteins in a 3-D way akin to capturing portraits of a range of individuals. This work is preliminary to understanding the operation and functioning of proteins in specific cellular pathways.
Professor Kim et al., at the University of California, Berkeley, have taken a step towards providing order to these protein data sets. Kim used computer analyses to calculate the relationships within a sampling of human proteins in order to develop a structural proteomic computer model. In this research, a 3-D representation of the protein fold space is presented, which is generally considered to be a sort protein periodic table (PPT). This SP data is organized to plainly show the evolution of protein structures from simple to complex forms. In this preliminary work, however, Kim does not place the PPT model into a functional model in order to give operational meaning to the fundamental protein structure data. Simulations based on the PPT are thus restricted in terms of their useful functional information.
Paek et al. at the University of Seoul in the Republic of Korea have presented a multi-layered model to represent cell signalling pathways. Software, such as Vector PathBlazer (and others), is also available to map biological pathways and present protein-protein interaction analysis, though it is generally limited and restricted because it relies on genomic and SP data sets. Using software tools for functional protein modeling, a new generation of biosystems modeling is available that will rapidly accelerate our understanding of genetic information. The HAPMAP is a database that collects information about haplotypes, combinations of single nucleotide polymorphisms (SNPs). This genetic mutation information is significant for the identifying of disease sources. However, the HAPMAP focuses on common haplotypes and not specific individuals' haplotypes and hence is not useful in the development of personalized medicine.
Personalized medicine that takes information about an individual's disease, uses experimental biological and computer techniques to trace the source to the genetic level, develops a combination of drugs to treat the disease and refines the therapy in a customized way is the goal of physicians and biological researchers. Yet only since the human genome has been deciphered has this goal of pharmacogenomics been possible. So far, only small advances have been made in which specific mutations in individuals with specific diseases, such as forms of cancer, have been traced to the genomic source. In these cases, customized combination drug therapies targeted to individual pathologies manage the disease.
The field of bioinformatics applies computational analysis to the biological sciences. One main research model for bioinformatics has been the application of artificial intelligence to biological systems. Koza and G. Fogel have done early research in this field. Koza's research on genetic programming, building on Holland's research in genetic algorithms, generally emulates biological processes of evolution by developing multiple generations of programs based on principles of mutation, sexual reproduction and natural selection in order to solve complex optimization problems. Guyon (pattern identification in biological systems), U.S. patent application number 20030172043, presents methods that use Support Vector Machines and Recursive Feature Elimination by optimizing training weights in a classifier for pattern identification. While this method applies EC techniques to gene and SP classification, it does not produce FP activity patterns that are useful for understanding proteomic processes.
Finally, the Santa Fe Institute (SFI) has accomplished sophisticated computational analyses of biological processes. SFI researchers have developed EC models for application to biological self-organizing systems in an effort to emulate these complex processes. By simulating genetic interactions, these researchers have developed a paradigm to understand the functional operation of complex evolutionary systems. However, this highly theoretical work has failed to provide useful systematic functional proteomic models or pharmacoproteomic models.
While the identification of the architecture of genes in the Human Genome Project (HGP) presents information on the construction of individual proteomic structures, much more needs to be done to advance our understanding of proteomic function. For example, if genetic diseases are caused by unique combinations of genetic mutations, the identification of these mutations is critical to understanding disease sources and finding solutions. Development of the HGP thus enables a shift in the emphasis in the biological sciences toward a personalized identifying and curing of disease. The field of human genetics shifts its emphasis to proteomics, pharmacogenomics and pharmacoproteomics.
The use of advanced computational analysis is fundamental to the field of proteomics. While most proteomics research so far has focused on predicting 3-D representations of protein structures, much work is yet to be done on understanding the operation of protein interactions in cellular pathways. One application of evolutionary computation to functional proteomics, for instance, is to compute the values of training weights of protein interactions so as to accurately emulate optimal FP operations. Though preliminary to our understanding of protein operations, these research streams leave much yet to be done.
Key Challenges
Now that the human genome has been sequenced, the next frontier for the biological sciences is post-genomic informatics and proteomics. Proteomics, the computational analysis of proteins, is divided into structural proteomics and functional proteomics. Structural proteomics seeks to understand the organizational properties of proteins from their twenty amino acid components, including geometrical and topological characteristics of protein configurations. Functional proteomics seeks to understand how proteins interact in a dynamic cellular environment.
Whereas genomics has been concerned with identifying the thirty-six thousand genes in the human genome, which consist of about three billion nucleic acid components, proteomics is concerned with a hundred times more information. Since cellular behavior is constituted of the interactions of hundreds of thousands of proteins, it is critical to understand interactions within this complex system if we are to understand the healthy, and pathological, operations of biology. By identifying the causes and organization of pathological proteomic interactions, researchers may be able not only to understand their genetic causes but also to design effective therapies.
There are several key questioned raised by functional proteomics. How can functional maps of proteins be organized from limited information? How can genetic information be connected to proteomic function and pathology? How can the function of certain proteins be predicted based on analogous protein structures, functions and interactions? How can multivariate simulations be designed that posit various protein pathway scenarios? How can dynamic simulations of proteomic processes be designed that present a methodology to select optimal as well as suboptimal simulation scenarios? How can protein irregularities and pathologies be modeled? How can cellular dysfunctions be isolated in silico and the conditions reverse engineered to discover the genetic source? How can dysfunctional protein-protein interactions be simulated?
How can pharmacoproteomic therapies be designed based on simulations of an individual's unique pathology and genetic mutations? How can these functional proteomics modeling approaches be used to engineer complex chemical compounds that repair genetic damage manifested in protein malfunctions? How can systems be designed to create DNA-based therapies and multivariate scenarios to test new chemical compounds so as to minimize side effects and injurious drug interactions?
The present invention addresses the challenges expressed in these questions.
The challenge of functional proteomics is to develop methods to visualize protein activity, typically with imperfect information. To do this, it is necessary to develop models from which simulations can be generated. Once healthy protein structures are mapped and functional proteomic activities are simulated, it becomes possible to analyze dysfunctional protein interaction processes. With information resources like the HGP and the HAPMAP, genetic information and mutation information can inform FP models about these dysfunctional protein operations. Not only can we trace the source of genetic diseases, we can now understand their complex operations, and thus move closer to developing effective therapies to manage them. So far, a large knowledge gap remains between the massive genomic data sets that we already have, on the one hand, and the useful data for biological systems that need to be developed, on the other. The expedient application of novel computational and experimental techniques is proposed to solve these problems.
As knowledge of functional proteomics increases, we should be able to identify the optimal parameters of good health, which will lead to increased longevity, and also identify the biochemical processes that cause and treat disease. In particular, the ability of the human body to fight various types of cancer and viruses, as well as degeneration manifested in aging, may be contingent on a better understanding of functional proteomics. The present invention therefore seeks to identify novel methods to meet these challenges and demonstrate (1) protein function visualization, (2) protein pathology identification and (3) personalized drug discovery and testing.