The present invention relates to methods and system for predicting the function of proteins. In particular, the invention relates to materials, software, automated system, and methods for implementing the same in order to predict the function(s) of a protein.
A central core of modern biology is that genetic information resides in a nucleic acid genome, and that the information embodied in such a genome (i.e., the genotype) directs cell function. This occurs through the expression of various genes in the genome of an organism and regulation of the expression of such genes. The expression of genes in a cell or organism defines the cell or organism""s physical characteristics (i.e., its phenotype). This is accomplished through the translation of genes into proteins. Proteins (or polypeptides) are linear polymers of amino acids. The polymerization reaction, which produces a protein, results in the loss of one molecule of water from each amino acid,; and hence proteins are often said to be composed of amino acid xe2x80x9cresidues.xe2x80x9d Natural protein molecules may contain as many as 20 different types of amino acid residues, each of which contains a distinctive side chain. The particular linear sequence of amino acid residues in a protein defines the primary sequence, or primary structure, of the protein. The primary structure of a protein can be determined with relative ease using known methods.
In order to more fully understand and determine potential therapeutics, antibiotic and biologics for various organisms, efforts have been taken to sequence the genomes of a number of organisms. For example the Human Genome Project began with the specific goal of obtaining the complete sequence of the human genome and determining the biochemical function(s) of each gene. To date, the project has resulted in sequencing a substantial portion of the human genome (Gibbs, 1995). At least twenty-one other genomes have already been sequenced, including, for example, M. genitalia (Fraser et al., 1995), M. jannaschii (Bult et al., 1996), H. influenzae (Fleischmann et al., 1995), E. coli (Blattner et al., 1997), and yeast (S. cerevisiae) (Mewes et al., 1997). Significant progress has also been made in sequencing the genomes of model organism, such as mouse, C. elegans, Arabadopsis sp. and D. melanogaster. Several databases containing genomic information annotated with some functional information are maintained by different organization, and are accessible via the internet. The raw nucleic acid sequences in a genome can be converted by one of a number of available algorithms to the amino acid sequences of proteins, which carry out the vast array of processes in a cell. Unfortunately, these raw protein sequence data do not immediately describe how the proteins function in the cell. Understanding the details of various cellular processes (e.g., metabolic pathways, signaling between molecules, cell division, etc.) and which proteins carry out which processes, is a central goal in modem cell biology.
Throughout evolution, the protein sequences in different organisms have been conserved to varying degrees. As a result, any given organism contains many proteins that are recognizably similar to proteins in other organisms. Such similar proteins, having arisen from the same ancestral protein, are called homologs.
To a degree homology between proteins is useful in assigning biological functions to new protein sequences. The most direct approach for assigning functions to proteins is by laborious laboratory experimentation. However, if a particular uncharacterized protein sequence is homologous to one that has already been studied experimentally, often the function of the former can be equated to the function of the latter.
Unfortunately, the ability to assign functions to proteins by homology is limited. Many protein sequences do not have experimentally characterized homologs in other organisms. Depending on the organism, between one-third and one-half of the proteins in a genome cannot be assigned functions by homology or other available computational methods. Accordingly, new methods for predicting the functions of proteins from genome sequences are needed.
Determining protein functions from genomic sequences is a central goal of bioinformatics. Genomic sequences do not contain explicit information on the function of the proteins that they encode, yet this information is critical in medical and agricultural biotechnology. The invention provides materials, software, automated system, and methods that are useful for predicting protein function. Such information is useful, for example, for identifying new genes and identifying potential targets for pharmaceutical compounds.
In one embodiment, the invention provides a method to predict functional links (e.g., associations between proteins) based on the concept that proteins that function together in a pathway or structural complex can often be found in another organism fused together into a single protein. By identifying these patterns of relationship or gene fusion one can predict the interactions between unknown proteins based on the similar sequence information found in other related proteins (i.e., either functionally related or physically related). Through sequence comparison, one can identify a fused protein, termed herein the xe2x80x9cRosetta Stonexe2x80x9d protein, which is similar over different regions to two distinct proteins that are not similar to each other. This establishes a functional link between two otherwise unrelated proteins. The inventors have discovered that proteins that can be associated together via the Rosetta Stone protein tend strongly to be functionally linked.
In another embodiment, the invention provides a computational method that detects proteins that participate in a common structural complex or metabolic pathway. Proteins within these groups are defined as xe2x80x9cfunctionally-linked.xe2x80x9d Functionally-linked proteins evolve in a correlated fashion, and therefore they have homologs in the same subset of organisms. For instance, it is expected that flagellar proteins will be found in bacteria that possess flagella but not in other organisms. Simply put, if two proteins have homologs in the same subset of fully (or nearly fully) sequenced organisms but are absent in other organisms they are likely to be functionally-linked. The present invention provides a method wherein this property is used to systematically map functional interactions between all the proteins coded by a genome. This method overcomes the problems wherein pairs of functionally linked proteins in general have no amino acid sequence similarity with each other and therefore cannot be linked by conventional sequence alignment techniques.
One embodiment provides a method of identifying multiple polypeptides as functionally-linked, the method including aligning a primary amino acid sequence of multiple distinct non-homologous polypeptides to the primary amino acid sequences of a plurality of proteins; and for any alignment found between the primary amino acid sequences of all of such multiple distinct non-homologous polypeptides and the primary amino acid sequence of at least one such protein, outputting an indication identifying the at least one such protein as an indication of a functional link between the multiple polypeptides.
In another embodiment, a computer program is provided for identifying a protein as functionally linked, the computer program comprising instructions for causing a computer system to align a primary amino acid sequence of multiple distinct non-homologous polypeptides to the primary amino acid sequences of a plurality of proteins; and for any alignment found between the primary amino acid sequences of all polypeptides and the primary amino acid sequence of an at least one such protein, output an indication of an identity of such protein.
In yet another embodiment, the invention provides a method of identifying a plurality of polypeptides as having a functional link, the method including aligning a primary amino acid sequence of a protein to the primary amino acid sequences of each of a plurality of distinct non-homologous polypeptides; and for any alignment found between the primary amino acid sequence of the protein and the primary amino acid sequence of the plurality of distinct non-homologous polypeptides, wherein the primary amino acid sequence of the protein contains an amino acid sequence similar to at least two distinct non-homologous polypeptides, outputting an indication identifying any distinct non-homologous polypeptides as functionally-linked.
In another embodiment the invention provides a computer program, stored on a computer-readable medium, for identifying a plurality of polypeptides as having a functional link, the computer program comprising instructions for causing a computer system to align a primary amino acid sequence of a protein to the primary amino acid sequences of each of a plurality of distinct non-homologous polypeptides; and for any alignment found between the primary amino acid sequences of the protein and the primary amino acid sequence of the plurality of distinct non-homologous polypeptides, wherein the primary amino acid of the protein contains an amino acid sequence from at least two distinct non-homologous polypeptides, and output an indication identifying any distinct non-homologous polypeptides as functionally-linked.
In yet another embodiment, the invention provides a method for identifying multiple proteins as having a functional link, comprising obtaining data, comprising a list of proteins from at least two genomes; comparing the list of proteins to form a protein phylogenetic profile for each protein or protein family, wherein the protein phylogenetic profile indicates the presence or absence of a protein belonging to a particular protein family in each of the at least two genomes based on homology of the proteins; and grouping the list of proteins based on similar profiles, wherein proteins with similar profiles are indicated to be functionally linked.
In yet still another embodiment, the invention provides a computer program, stored on a computer-readable medium, for identifying multiple polypeptides as having a functional link, the computer program comprising instructions for causing a computer system to obtain data, comprising a list of proteins from at least two genomes; compare the data to form a protein phylogenetic profile for each protein or protein family, wherein the protein phylogenetic profile indicates the presence or absence of a protein belonging to a particular protein family in each of the at least two genomes based on homology of the proteins; and group the list of proteins based on similar profiles, wherein proteins with similar profiles are indicated to be functionally linked.
In yet another embodiment, the invention provides a method for determining an evolutionary distance between two proteins, the distances being used as additional information, beyond mere presence or absence from a genome, in comparing the phylogenetic profiles of different proteins. The method including aligning two sequences; determining an evolution probability process by constructing a conditional probability matrix: p(aaxe2x86x92aaxe2x80x2), where aa and aaxe2x80x2 are any amino acids, said conditional probability matrix being constructed by converting an amino acid substitution matrix from a log odds matrix to said conditional probability matrix; accounting for an observed alignment of the constructed conditional probability matrix by taking the product of the condition probabilities for each aligned pair during the alignment of the two sequences, represented by       P    ⁡          (      p      )        =            ∏      n        ⁢          p      ⁡              (                              aa            n                    →                      aa            n            xe2x80x2                          )            
and determining an evolutionary distance xcex1 from powers equation: pxe2x80x2=pxcex1(aaxe2x86x92aaxe2x80x2) maximizing for P. In a further embodiment, the conditional probability matrix is defined by a Markov process with substitution rates, over a fixed time interval.
In yet a further embodiment, the invention provides a method for determining functional links between at least two polypeptides, comprising aligning a primary amino acid sequence of multiple distinct non-homologous polypeptides to the primary amino acid sequences of a plurality of proteins; for any alignment found between the primary amino acid sequences of all of such multiple distinct non-homologous polypeptides and the primary amino acid sequence of at least one such protein, outputting an indication identifying the at least one such protein as an indication of a functional link between the multiple polypeptides; obtaining data, comprising a list of polypeptides from at least two genomes; comparing the list of polypeptides from at least two genomes to form a protein phylogenetic profile for each protein or protein family, wherein the protein phylogenetic profile indicates the presence or absence of a polypeptide belonging to a particular protein family in each of the at least two genomes based on homology of the polypeptides; grouping the list of polypeptides based on similar profiles, wherein a similar profile is indicative of a functional link between the polypeptides; and comparing the functional links identified above to determine common links.
In yet another embodiment, the invention further provides for displaying the functional links as networks of related proteins comprising placing all polypeptides in a diagram such that functionally linked proteins are closer together than all other proteins and identifying proteins that fall in a cluster in the diagram as a functionally related group.
The details of one or more embodiments of the invention are set forth in the accompanying drawings and the description below. Other features, objects, and advantages of the invention will be apparent from the description and drawings, and from the claims.