Proteins are primary components of the complex, interconnected pathways of cellular function. Proteomics, the study of protein location, interaction, structure and function, aims to identify and characterize the proteins differentially expressed in normal versus diseased biological samples. Abnormalities in protein production or function have been connected to many mammalian conditions, diseases and disorders. Therefore, the ability to identify proteins that cause or contribute to disease processes and to correctly supply or modulate these proteins represents an opportunity for therapeutic intervention. For example, oncoproteins can cause cancer by interacting with and activating proteins responsible for cell division in a manner that results in unregulated cellular growth. Protein interactions are also central to a virus recognizing its cell surface receptor prior to infection. Exact identification of such proteins and their interactions not only leads to a broader understanding of protein biology, but also provides the specific molecules that can be used to develop effective therapeutics.
Biochemical and genetic methods have been used to study protein interactions. The biochemical methods are laborious and slow involving numerous steps including isolation, purification, sequencing, and characterization of the proteins being tested for interaction. Small domains, regions within the proteins that fold independently, facilitate such interactions. Genetic approaches have gained in popularity in that they allow a more rapid detection of the domains involved in protein interactions.
The Human Genome Project (HGP) has produced vast amounts of sequence data for analysis. Numerous novel sequences, some of them conserved in diverse species, from E. coli to Homo sapiens, have been identified. The amount of data to be sorted, characterized and mapped has led to the development of numerous databases that contain genomic and protein sequence information. Parallel information explosions are resulting from biochemical studies of newly discovered proteins. Data on differential expression of these proteins has been widely collected for cells and tissues of many species under specific conditions. The mining of such data, mainly through homology searches, provides a broad view of how genes are expressed and in some cells and tissues, differential expression of mRNA is proportional to expression of the encoded proteins (Glavas et al. (2001) Proc Natl Acad Sci 98:6319–6324). Relating sequence data to protein pathways adds additional levels of complexity, but also identifies the best points for intervention in the disease process.
A pathway is a collection of at least two proteins or molecules connected by their interactions in a cell or tissue. Two pathways in the same species are called homologous if their proteins and/or interactions are similar; whereas the similar pathways are called orthologous if they occur in different species. Although the storage of information on protein interactions is at an early stage, additional tools for analysis of this information, methods for performing homologous and orthologous pathways searches and showing their relationships, are needed to handle the large amounts of information that will be stored in relational databases.
Bioinformatics capability and capacity will be needed to handle the sequence data produced from large sequencing efforts. For example, many large-scale sequence databases including NCBI GenBank (Bethesda, Md.) and SwissProt (Geneva, Switzerland) have been built to maintain current genomic and protein sequence information. Many sequence comparison packages such as LASERGENE software (DNASTAR, Madison Wis.) and specialized tools such as BLAST2 (Altschul et al. (1997) Nucleic Acids Res 25:3389–3402) and PHRAP (Phillip Green, University of Washington, Seattle Wash.) have been developed to study the sequences in these large databases. In addition, Hidden-Markov-Models (HMM, Pearson and Lipman (1988) Proc Natl Acad Sci 85:2444–2448) have been developed to work with proteins or with a protein families database such as the PFAM database (Washington University, St. Louis). For example, the SignalP program (Nielsen et al (1997) Protein Engineering 10:1–6) is an HMM that searches a given protein sequence for a signal peptide and its cleavage site. The pathways and bioinformatics above are discussed inter alia in Kanehisa and Goto (2000; Nucleic Acids Res 28:27–30); Nakao et al. (1999; Genome Inform Ser Workshop Genome Inform 10:94–103); and U.S. Pat. No. 6,057,101, which are incorporated by reference herein.
Although sequence homology has been the core of bioinformatic searches for many years, the more recent use of statistical clustering analysis has provided a major advance in the study of genes, proteins, and their function. The present invention satisfies a need in the art by providing methods for establishing a pathways database, that integrates data from biochemical or metabolic pathways, sequence homology, and expression data, and for using novel algorithms for optimization, dynamic programming, and constrained clustering with the pathways database to advance discovery of protein function and therapeutic intervention.