1. Field of the Invention
The present invention concerns methods and systems for predicting the function of proteins. In particular, the invention relates to materials, software, automated systems, and methods for implementing the same in order to predict the function(s) of a protein. Protein function prediction includes the use of functional site descriptors for a particular protein function.
2. Background of the Invention
The following description includes information that may be useful in understanding the present invention. It is not an admission that any of the information provided herein is prior art to the presently claimed invention, nor that any of the publications specifically or implicitly referenced are prior art to that invention.
A central tenet of modern biology is that heritable genetic information resides in a nucleic acid genome, and that the information embodied in such nucleic acids directs cell function. This occurs through the expression of various genes in the genome of an organism and regulation of the expression of such genes. The pattern of which subset of genes in an organism is expressed at a particular time in a particular cell defines the phenotype, and ultimately cell and tissue types. While the least genetically complex organisms, i.e., viruses, contain on the order of 10-50 genes and require components supplied by a cell of another organism in order to reproduce, the genomes of independent, living organisms (i.e., those having a genome that encodes for all the information required for the organism to survive and reproduce) that are the least genetically complex have more than 400 genes (for example, Mycoplasma genitalium). More complex, multicellular organisms (e.g., mice or humans) contain genomes believed to be comprised of tens of thousands or more genes, each of which codes for one or more different expression products.
Most organismal genomes are comprised of double-stranded DNA. Each strand of the genomic DNA is comprised of a long polymer of the four deoxyribonucleotide bases A (adenine), T (thymine), G (guanine), and C (cytosine). Double-stranded DNA is formed by the anti-parallel, non-covalent association between two DNA strands. This association is mediated by hydrogen bonding between nucleotide bases, with specific, complementary pairing of A with T and G with C. Each gene in the genomic DNA is expressed by transcription, wherein a single-stranded RNA copy of the gene is transcribed from the double-stranded DNA. The transcribed strand of RNA is complementary to the coding strand of the DNA. RNA is composed of ribonucleotide (rather than deoxyribonucleotide) bases, three of which are similar to those found in DNA: A, G, and C. The fourth RNA ribonucleotide base, uracil (U), substitutes for T found in DNA and is complementary to the A base. Following transcription, the RNAs transcribed from many genes are translated into polypeptides. The particular sequence of the nucleotide bases normally determines what protein, and hence what function(s), a particular gene encodes.
Some genes are transcribed, but not translated; thus, the final gene products of these genes are RNA molecules (for example, ribosomal RNAs, small nuclear PNAs, transfer RNAs, and ribozymes (i.e., RNA molecules having endoribonuclease catalytic activity). However, most RNAs serve as messengers (mRNAs), and these are translated into polypeptides. The particular sequence of the ribonucleotides incorporated into an RNA as it is synthesized is dictated by the gene found in the genomic DNA from which it was transcribed. In the translation of an mRNA, the particular nucleotide sequence determines the particular amino acid sequence of the polypeptide translated therefrom. Briefly, in a coding region of an mRNA (and in its corresponding gene), each nucleotide triplet, or xe2x80x9ccodonxe2x80x9d (of which there are 43, or 64, possibilities) codes for one amino acid, except that three codons code for no amino acids (each being a xe2x80x9cstopxe2x80x9d translation codon). Thus, the sequence of codons (dictated by the nucleotide sequence of the corresponding gene) specifies the amino acid sequence of a particular protein, and it is the amino acid sequence that ultimately determines the three-dimensional structure of the protein. Significantly, three-dimensional structure dictates the particular biological function(s) of any biomolecule, including proteins.
The elegant simplicity of the foregoing schema is obscured by the complexity and size of the genomes found in living systems. For example, the haploid human genome comprises about 3xc3x97109 (three billion) nucleotides spread across 23 chromosomes. However, it is currently estimated that less than 5% of this encodes the approximately 80,000-100,000 different protein-coding genes believed to be encoded by the human genome. Because of its tremendous size, to date only a portion of the human genome has been sequenced and deposited in genome sequence databases, and the positions of many genes and their exact nucleotide sequences remain unknown. Moreover, the biological function(s) of the gene products encoded by many of the genes sequenced so far remain unknown. Similar situations exist with respect to the genomes of many other organisms.
Notwithstanding such complexities, numerous genome sequence efforts designed to determine the exact sequence of the nucleotides found in genomic DNA of various organisms are underway and significant progress has been made. For example, the Human Genome Project began with the specific goal of obtaining the complete sequence of the human genome and determining the biochemical function(s) of each gene. To date, the project has resulted in sequencing a substantial portion of the human genome, and is on track for its scheduled completion in the near future. At least twenty-one other genomes have already been sequenced, including, for example, M. genitalium, M. jannaschii, H. influenzae, E. coli, and yeast (S. cerevisiae). Significant progress has also been made in sequencing the genomes of model organisms, such as mouse, C. elegans, and D. melanogaster. Several databases containing genomic information annotated with some functional information are maintained by different organizations, and are accessible via the internet.
Such sequencing projects result in vast amounts of nucleotide sequence information, which is typically deposited in genome sequence databases. However, these raw data (much of it being known only at the cDNA level), being devoid of corresponding information about genes and protein structure or function, are in and of themselves of extremely limited use (Koonin, et al. (1998), Curr. Opin. Struct. Biol., vol. 8:355-363). Thus, the practical exploitation of the vast numbers of sequences in such genome sequence databases is crucially dependent on the ability to identify genes and, for example, the function(s) of gene-encoded proteins.
To maximize the utility of such nucleotide sequence information, it must be interpreted. For example, it is important to understand where each sequence is located in the genome, and what biological function(s), if any, the sequence encodes, i.e., what is the purpose of the sequence or, if transcribed (or transcribed and translated), the resulting product, in a biological system? For example, is the sequence a regulatory region or, if it is transcribed (or transcribed and translated), does the gene product bind to another molecule, regulate a cellular process, or catalyze a chemical reaction?
To answer these questions, significant effort has been directed towards understanding or describing the biological function(s) coded for in each nucleotide sequence. Predicting the function(s) of biomolecules encoded by genes, particularly proteins, is most often done by sequence comparison to known structures. The basis of this approach is the commonly accepted notion that similar sequences must have a common ancestor, and would therefore have similar structures and related functions. Accordingly, algorithms have been developed to analyze what a particular nucleotide sequence encodes, e.g., a regulatory region, an open reading frame (ORF), particularly for protein sequences, or a non-translated RNA. See, e.g., xe2x80x9cFramesxe2x80x9d (Genetics Computer Group, Madison, Wis.), which is used for identifying ORFs. For sequences predicted or determined to be ORFs, it is possible to determine the amino acid sequence of the protein encoded thereby using simple analytical tools well known in the art. For example, see xe2x80x9cTranslatexe2x80x9d (Genetics Computer Group, Madison, Wis.). However, to date determination of the primary structure of a protein in and of itself provides little, if any, functional information about the protein or its corresponding gene.
A number of methods have been developed in an attempt to glean functional information about a deduced amino acid sequence. The most common computational methods include sequence alignment and analysis of local sequence motifs, although these methods are limited by the extent of sequence similarity between sequences of unknown and known function. Additionally, these methods increasingly fail as sequence identity decreases. Other recently developed computational methods include whole genome comparison (Himmelreich et al., 1997), and analysis of gene clustering (Himmelreich et al., 1997; Tamames et al., 1997). Others have developed experimental methods to analyze protein function on a gemone-wide basis. These methods include, for example, xe2x80x9ctwo hybrid screensxe2x80x9d (Fromont-Racine et al., 1997) and genome-wide scanning of gene expression patterns (Ito and Sakaki, 1996).
Sequence alignment is the method most commonly used in attempts to identify protein function from amino acid sequence. In this method, the extent of amino acid sequence identity between an experimental sequence and one or more sequences whose function(s) is(are) known is computed. Alignment methods such as BLAST (Altschul et al., 1990), BLITZ (MPsrch) (Sturrock and Collins, 1993), and FASTA (Pearson and Lipman, 1988) are typically employed for this purpose. Assignment of function is based on the theory that significant sequence identity strongly predicts functional similarity (Fitch, 1970?).
However, because of the frequent lack of substantial sequence similarity among proteins, these methods often fail (Delseny et al., 1997; Dujon, 1996). Additionally, newly discovered amino acid or nucleotide sequences frequently do not match any known or available sequence. Indeed, many protein amino acid sequences (from 30-60% or more) that have been deduced from genome project-derived nucleotide sequence information represent novel protein families with unknown function, and for which no homologous sequence can be identified (Delseny et al., 1997; Dujon, 1996). Furthermore, such conventional sequence alignment methods cannot consistently detect functional and structural similarities, particularly when sequence identity is less than about 25-30%. Hobohm and Sander, 1995. In practice, roughly half of a given genome falls into one of these two categories or no homology, or less than about 25-30% homology, with a known sequence. Bork and Koonin (1998), Nature Genet., vol. 18: 313-318; E. V. Koonin (1997), Curr. Biol., vol. 7:R656-R659. It is also important to understand that matches with 50% or more identity over a 40-amino acid or smaller stretch of sequences often occur by chance, and if other information is lacking, relationships between such proteins are viewed with caution (Pearson, 1996).
In an attempt to overcome some of the problems associated with employing sequence alignments to help predict protein function, several groups have developed databases of short, local sequence patterns (or xe2x80x9cmotifsxe2x80x9d) designed to help identify a given function or activity of a protein. These databases, notably xe2x80x9cPROSITExe2x80x9d (Bairoch et at., 1997, Nuci. Acid Res., vol. 25:31-36), xe2x80x9cBlocksxe2x80x9d (Henikoff and Henikoff, 1994, Genomics, vol. 19:97-107), and xe2x80x9cPRINTSxe2x80x9d (Attwood and Beck, 1994, Nuci. Acids Res., vol. 22:3590-3596), use local sequence information (i.e., the sequence of several contiguous amino acid residues), as opposed to entire amino acid sequences, in order to try to identify sequence patterns that are specific for a given function.
Function prediction based on local sequence signatures, however, is plagued by the deficiencies that also limit the use of sequence alignment algorithms to predict protein function. Specifically, as sequence diversity within protein families increases, conventional databases of local sequence signatures may no longer recognize experimental protein sequences as belonging to a functional family (Fetrow and Skolnick, 1998, J. Mol. Biol., vol. 281:949-968). In proteins that are distantly related in terms of evolution, it is expected that only those residues required for the specific biological function of a protein will be conserved. That conservation will include not only sequence conservation, but also three-dimensional structural conservation (Holm and Sander, 1994, Proteins, vol. 19:165-173). However, local sequence motifs cannot recognize conserved three-dimensional structurexe2x80x94motifs can only recognize local sequence. Consequently, local sequence motifs may fail to be accurate predictors of protein function because function derives from three-dimensional structure. In other words, local sequence motif analysis is limited where function is dependent upon non-local residues, i.e., amino acids disposed in different regions of a protein""s primary structure.
Many functional sites in proteins are known to comprise non-local residues. However, these residues are brought into functional association as a result of the protein assuming its folded three-dimensional structure, where different regions of the protein (in terms of linear amino acid sequence) may come together. For example, the three-dimensional structure of urease (a protein involved in nucleotide metabolism) was recently compared to those of adenosine deaminase and phosphotriesterase (Holm and Sander, 1997b), proteins that are also involved in nucleotide metabolism. Previous one-dimensional sequence comparisons failed to detect any relationship between these proteins; however, comparison of their three-dimensional structures showed conservation of active site structure. This same active site geometry was later observed in other nucleotide metabolism enzymes which exhibited an even greater diversity of overall sequence and tertiary structure (Holm and Sander, 1997b). In another example, it was determined that critical cysteine residues in the catalytic domain of ribonucleotide reductases were conserved across kingdom boundaries (Tauer and Benner, 1997). However, sequence alignment analysis did not reveal this relatedness because of the non-local nature of the conserved catalytic cysteine residues.
Various efforts have been made to overcome these limitations by, for example, extending local sequence patterns to include structural information. The goal of including such added information is to improve the ability of local sequence patterns to both detect distantly related proteins and differentiate between true and false positives. See, e.g., Kasuya, A. and Thornton, J. M., J. Mol. Biol., vol. 286: 1673-1691 (1999). Others have postulated that the development of databases of 3D-templates, such as those that currently exist for local protein sequence motifs, may help to identify the functions of new protein structures as they are determined and pinpoint their functionally important regions. For example, Wallace, et al. (Protein Science, vol. 5:1001-1013 (1996)) reported the development of a 3D coordinate template for Ser-His-Asp the catalytic triad in serine proteases and triacylglycerol lipases. Initially, those authors selected a single xe2x80x9cseedxe2x80x9d catalytic triad from xcex1-lyitc proteinase 1lpr (see Bone, et al., Biochemistry, vol. 30:10388-10398 (1991)), and coordinate positions were determined for all of the Ser and Asp side chain atoms, as well as for the positions of the atoms in the reference His residue. Root mean square distances (RMSDs) were then determined for all Ser and Asp side chain atoms in a set of serine proteases whose structures were also then known at atomic resolution. This analysis revealed that the positioning of a single oxygen atom in each of the Asp and Ser side chains was highly conserved. Using these data, a 3D template was developed for serine protease activity using the identity of three amino acids, namely Ser, His, and Asp, and the 3D coordinate positions (to an RMSD cut-off of 2 xc3x85) for the functional oxygen atoms in the Ser and Asp side chains and the non-hydrogen atoms of the His side chain. The 3D template was then applied to a test set of high resolution protein structures drawn from the PDB database.
A major shortcoming of the foregoing 3D-template approach (see also Barth, et al. (1993) Drug Design and Discovery, vol. 10:297-317; Gregory, et al. (1993), Protein Eng., vol. 6, no. 1:29-35; Artymiuk, et al. (1994), J. Mol. Biol., vol. 243:327-344; and Fischer, et al. (1994), Protein Sci., vol. 3:769-778), however, is that they require detailed knowledge of atomic positions (particularly for side chain atoms) in both the template structures and the test protein structure. This makes these 3D templates applicable only to high-resolution protein structures determined by x-ray crystallography or NMR spectroscopy. Less than atomic resolution structures and inexact models produced by current protein structure prediction algorithms cannot be analyzed by these methods.
In sum, conventional sequence-based function prediction methods fall short in the prediction of protein function from nucleotide or amino acid sequence information, in part because the technology frequently relies only on sequence information. Current structure-based methods said to have some utility for function prediction also fail in the analysis of sequences of unknown function, including genome sequences, because high-resolution structures, and their accompanying high level of atomic detail, are required. As such, there remains a need for better methods for predicting protein structure and function.
The inventions described and claimed herein solve these needs by providing novel methods and systems for predicting protein function from sequence. Various methods described and claimed herein use sequence and structure information and apply this information to protein structures, particularly inexact models of protein structure, that can be computationally derived from protein or nucleic acid sequences. Using their methods, the inventors have discovered that it is not necessary to accurately predict the overall three-dimensional structure of a particular protein of interest in order to predict its function. Instead, prediction of biological function using the methods described and claimed herein requires only an approximation of the three-dimensional orientation of two or more amino acid residues in a region responsible for the particular function of the protein under investigation. As such, this invention overcomes the problems and limitations of the methods previously utilized in an attempt to identify protein function from either sequence or structure. As those in the art will appreciate, such methods can routinely be adapted with respect to various protein functional sites in order to predict protein function. A more detailed description of the invention is provided below.
3. Definitions
The following terms have the following meanings when used herein and in the appended claims. Terms not specifically defined herein have their art recognized meaning.
As used herein, an xe2x80x9camino acidxe2x80x9d is a molecule (see FIG. 1) having the structure wherein a central carbon atom (the alpha (xcex1)-carbon atom) is linked to a hydrogen atom, a carboxylic acid group (the carbon atom of which is referred to herein as a xe2x80x9ccarboxyl carbon atomxe2x80x9d), an amino group (the nitrogen atom of which is referred to herein as an xe2x80x9camino nitrogen atomxe2x80x9d), and a side chain group, R. When incorporated into a peptide, polypeptide, or protein, an amino acid loses one or more atoms of its amino and carboxylic groups in the dehydration reaction that links one amino acid to another. As a result, when incorporated into a protein, an amino acid is referred to as an xe2x80x9camino acid residue.xe2x80x9d In the case of naturally occurring proteins, an amino acid residue""s R group differentiates the 20 amino acids from which proteins are synthesized, although one or more amino acid residues in a protein may be derivatized or modified following incorporation into protein in biological systems (e.g., by glycosylation and/or by the formation of cystine through the oxidation of the thiol side chains of two non-adjacent cysteine amino acid residues, resulting in a disulfide covalent bond that frequently plays an important role in stabilizing the folded conformation of a protein, etc.). As those in the art will appreciate, non-naturally occurring amino acids can also be incorporated into proteins, particularly those produced by synthetic methods, including solid state and other automated synthesis methods. Examples of such amino acids include, without limitation, xcex1-amino isobutyric acid, 4-amino butyric acid, L-amino butyric acid, 6-amino hexanoic acid, 2-amino isobutyric acid, 3-amino propionic acid, ornithine, norlensine, norvaline, hydroxproline, sarcosine, citralline, cysteic acid, t-butylglyine, t-butylalanine, phenylylycine, cyclohexylalanine, xcex2-alanine, fluoro-amino acids, designer amino acids (e.g., xcex2-methyl amino acids, xcex1-methyl amino acids, Nxcex1-methyl amino acids) and amino acid analogs in general. In addition, when an xcex1-carbon atom has four different groups (as is the case with the 20 amino acids used by biological systems to synthesize proteins, except for glycine, which has two hydrogen atoms bonded to the xcex1 carbon atom), two different enantiomeric forms of each amino acid exist, designated D and L. In mammals, only L-amino acids are incorporated into naturally occurring polypeptides. Of course, the instant invention envisions proteins incorporating one or more D- and L- amino acids, as well as proteins comprised of just D- or L-amino acid residues.
Herein, the following abbreviations may be used for the following amino acids (and residues thereof): alanine (Ala, A); arginine (Arg, R); asparagine (Asn, N); aspartic acid (Asp, D); cyteine (Cys, C); glycine (Gly, G); glutamic acid (Glu, E); glutamine (Gln, Q); histidine (His, H); isoleucine (Ile, I); leucine (Leu, L); lysine (Lys, K); methionine (Met, M); phenylalanine (Phe, F); proline (Pro, P); serine (Ser, S); threonine (Thr, T); tryptophan (Trp, W); tyrosine (Tyr, Y); and valine (Val, V). Non-polar (hydrophobic) amino acids include alanine, leucine, isoleucine, valine, proline, phenylalanine, tryptophan, and methionines. Neutral amino acids include glycine, serine, threonine, cysteine, tyrosine, esparagine, and glutamine. Positively charged (basic amino acids include arginine, lysine and histidine. Negatively charged (acidic) amino acids include aspartic acid and glutamic acid.
As used herein, a xe2x80x9cxcex2-carbon atomxe2x80x9d refers to the carbon atom (if present) in the R group of the side chain of an amino acid (or amino acid residue) that is covalently bonded to the xcex1-carbon atom of that amino acid (or residue). See FIG. 1. For purposes of this invention, glycine is the only naturally occurring amino acid found in mammalian proteins that does not contain a xcex2-carbon atom.
A xe2x80x9cbiomoleculexe2x80x9d refers to any molecule (including synthetic molecules) produced by a cell, found within a cell or organism, or which can be introduced into a cell or organism, or any derivative of such a molecule, and any other molecule capable of performing or having a biological function. Representative examples of biomolecules include nucleic acids and proteins. A xe2x80x9csyntheticxe2x80x9d biomolecule is one that has been prepared, in whole or part, through the use of one or more synthetic chemical reactions.
xe2x80x9cProteinxe2x80x9d refers to any polymer of two or more individual amino acids (whether or not naturally occurring) linked via a peptide bond, and occurs when the carboxyl carbon atom of the carboxylic acid group bonded to the xcex1-carbon of one amino acid (or amino acid residue) becomes covalently bound to the amino nitrogen atom of amino group bonded to the xcex1-carbon of an adjacent amino acid. See FIG. 1. These peptide bond linkages, and the atoms comprising them (i.e., xcex1-carbon atoms, carboxyl carbon atoms (and their substituent oxygen atoms), and amino nitrogen atoms (and their substituent hydrogen atoms)) form the xe2x80x9cpolypeptide backbonexe2x80x9d of the protein. In simplest terms, the polypeptide backbone shall be understood to refer the amino nitrogen atoms, xcex1-carbon atoms, and carboxyl carbon atoms of the protein, although two or more of these atoms (with or without their substituent atoms) may also be represented as a pseudoatom. Indeed, any representation representing a polypeptide backbone that can be used in a functional site descriptor as described herein will be understood to be included within the meaning of the term xe2x80x9cpolypeptide backbone.xe2x80x9d
The term xe2x80x9cproteinxe2x80x9d is understood to include the terms xe2x80x9cpolypeptidexe2x80x9d and xe2x80x9cpeptidexe2x80x9d (which, at times, may be used interchangeably herein) within its meaning. In addition, proteins comprising multiple polypeptide subunits (e.g., DNA polymerase III, RNA polymerase II) or other components (for example, an RNA molecule, as occurs in telomerase) will also be understood to be included within the meaning of xe2x80x9cproteinxe2x80x9d as used herein. Similarly, fragments of proteins and polypeptides are also within the scope of the invention and may be referred to herein as xe2x80x9cproteins.xe2x80x9d
In biological systems (be they in vivo or in vitro, including cell-free, systems), the particular amino acid sequence of a given protein (i.e., the polypeptide""s xe2x80x9cprimary structure,xe2x80x9d when written from the amino-terminus to carboxy-terminus) is determined by the nucleotide sequence of the coding portion of a mRNA, which is in turn specified by genetic information, typically genomic DNA (which, for purposes of this invention, is understood to include organelle DNA, for example, mitochondrial DNA and chloroplast DNA). Of course, any type of nucleic acid which constitutes the genome of a particular organism (e.g., double-stranded DNA in the case of most animals and plants, single or double-stranded RNA in the case of some viruses, etc.) is understood to code for the gene product(s) of the particular organism. Messenger RNA is translated on a ribosome, which catalyzes the polymerization of a free amino acid, the particular identity of which is specified by the particular codon (with respect to mRNA, three adjacent A, G, C, or U ribonucleotides in the mRNA""s coding region) of the mRNA then being translated, to a nascent polypeptide. Recombinant DNA techniques have enabled the large-scale synthesis of polypeptides (e.g., human insulin, human growth hormone, erythropoietin, granulocyte colony stimulating factor, etc.) having the same primary sequence as when produced naturally in living organisms. In addition, such technology has allowed the synthesis of analogs of these and other proteins, which analogs may contain one or more amino acid deletions, insertions, and/or substitutions as compared to the native proteins. Recombinant DNA technology also enables the synthesis of entirely novel proteins.
In non-biological systems (e.g., those employing solid state synthesis), the primary structure of a protein (which also includes disulfide (cystine) bond locations) can be determined by the user. As a result, polypeptides having a primary structure that duplicates that of a biologically produced protein can be achieved, as can analogs of such proteins. In addition, completely novel polypeptides can also be synthesized, as can protein incorporating non-naturally occurring amino acids.
In a protein, the peptide bonds between adjacent amino acid residues are resonance hybrids of two different electron isomeric structures, wherein a bond between a carbonyl carbon (the carbon atom of the carboxylic acid group of one amino acid after its incorporation into a protein) and a nitrogen atom of the amino group of the xcex1-carbon of the next amino acid places the carbonyl carbon approximately 1.33 xc3x85 away from the nitrogen atom of the next amino acid, a distance about midway between the distances that would be expected for a double bond (about 1.25 xc3x85) and a single bond (about 1.45 xc3x85). This partial double bond character prevents free rotation of the carbonyl carbon and amino nitrogen about the bond therebetween under physiological conditions. As a result, the atoms bonded to the carbonyl carbon and amino nitrogen reside in the same plane, and provide discrete regions of structural rigidity, and hence conformational predictability, in proteins.
Beyond the peptide bond, each amino acid residue contributes two additional single covalent bonds to the polypeptide chain. While the peptide bond limits rotational freedom of the carbonyl carbon and the amino nitrogen of adjacent amino acids, the single bonds of each residue (between the xcex1-carbon and carbonyl carbon (the phi (xcfx86) bond) and between the xcex1-carbon and amino nitrogen (the psi ("psgr") bond) of each amino acid), have greater rotational freedom. For example, the rotational angles for xcfx86 and "psgr" bonds for certain common regular secondary structures are listed in the following table:
Similarly, the single bond between a xcex1-carbon and its attached R-group provides limited rotational freedom. Collectively, such structural flexibility enables a number of possible conformations to be assumed at a given region within a polypeptide. As discussed in greater detail below, the particular conformation actually assumed depends on thermodynamic considerations, with the lowest energy conformation being preferred.
In addition to primary structure, proteins also have secondary, tertiary, and, in multisubunit proteins, quaternary structure. Secondary structure refers to local conformation of the polypeptide chain, with reference to the covalently linked atoms of the peptide bonds and xcex1-carbon linkages that string the amino acids of the protein together. Side chain groups are not typically included in such descriptions. Representative examples of secondary structures include xcex1 helices, parallel and anti-parallel xcex2 structures, and structural motifs such as helix-turn-helix, xcex2-xcex1-xcex2, the leucine zipper, the zinc finger, the xcex2-barrel, and the immunoglobulin fold. Movement of such domains relative to each other often relates to biological function and, in proteins having more than one function, different binding or effector sites can be located in different domains. Tertiary structure concerns the total three-dimensional structure of a protein, including the spatial relationships of amino acid side chains and the geometric relationship of different regions of the protein. Quaternary structure relates to the structure and non-covalent association of different polypeptide subunits in a multisubunit protein.
A xe2x80x9cfunctional sitexe2x80x9d refers to any site in a protein that has a function. Representative examples include active sites (i.e., those sites in catalytic proteins where catalysis occurs), protein-protein interaction sites, sites for chemical modification (e.g., glycosylation and phosphorylation sites), and ligand binding sites. Ligand binding sites include, but are not limited to, metal binding sites, co-factor binding sites, antigen binding sites, substrate channels and tunnels, and substrate binding sites. In an enzyme, a ligand binding site that is a substrate binding site may also be an active site.
A xe2x80x9cpseudoatomxe2x80x9d refers to a position in three dimensional space (represented typically by an x, y, and z coordinate set) that represents the average (or weighted average) position of two or more atoms in a protein or amino acid. Representative examples of a pseudoatom include an amino acid side chain center of mass and the center of mass (or, alternatively, the average position) of an xcex1-carbon atom and the carboxyl atom bonded thereto.
A xe2x80x9creduced modelxe2x80x9d refers to a three-dimensional structural model of a protein wherein fewer than all heavy atoms (e.g., carbon, oxygen, nitrogen, and sulfur atoms) of the protein are represented. For example, a reduced model might consist of just the xcex1-carbon atoms of the protein, with each amino acid connected to the subsequent amino acid by a virtual bond. Other examples of reduced protein models include those in which only the xcex1-carbon atoms and side chain centers of mass of each amino acid are represented, or where only the polypeptide backbone is represented.
A xe2x80x9cgeometric constraintxe2x80x9d refers to a spatial representation of an atom or group of atoms (e.g., an amino acid, the R-group of an amino acid, the center of mass of an R-group of an amino acid, a pseudoatom, etc.). Accordingly, such a constraint can be represented by coordinates in three dimensions, for example, as having a certain position, or range of positions, along x, y, and z coordinates (i.e., a xe2x80x9ccoordinate setxe2x80x9d). Alternatively, a geometric constraint can be represented as a distance, or range of distances, between a particular atom (or group of atoms, etc.) and one or more other atoms (or groups of atoms, etc.). Geometric constraints can also be represented by various types of angles, including the angle of bonds (particularly covalent bonds, e.g., xcfx86 bonds and "psgr" bonds) between atoms in an amino acid residue, between atoms in different amino acid residues, and between atoms in an amino acid residue of a protein and another molecule, e.g., a ligand, with ranges for each angle being preferred.
A xe2x80x9cconformational constraintxe2x80x9d refers to the presence of a particular protein conformation, for example, an xcex1-helix, parallel and antiparallel xcex2 strands, leucine zipper, zinc finger, etc. In addition, conformational constraints can include amino acid sequence information without additional structural information. As an example, xe2x80x9cxe2x80x94Cxe2x80x94Xxe2x80x94Xxe2x80x94Cxe2x80x94xe2x80x9d is a conformational constraint indicating that two cysteine residues must be separated by two other amino acid residues, the identities of each of which are irrelevant in the context of this particular constraint.
An xe2x80x9cidentity constraintxe2x80x9d refers to a constraint of a functional site descriptor that indicates the identity of an amino acid residue at a particular location in a protein. (determined by counting the number of amino acid residues in the protein from its amino terminus up to and including the residue in question). As those in the art will appreciate, comparison between related proteins may reveal that the identity of a particular amino acid residue at a given amino acid position in a protein is not entirely conserved, i.e., different amino acid residues may be present at a particular amino acid position in related proteins. In such instances or, alternatively, when an artisan desires to relax the constraint, two or more alternative amino acid residue identities can be provided for a particular identity constraint of a functional site descriptor. Of course, in such cases the invention also envisions different functional site descriptors for the particular biological function that differ by employing different amino acid residue identities (or sets of identities) for the corresponding position. For example, where it is determined by sequence alignment that related proteins have one of two amino acid residues at a particular position in the functional site, a single functional site descriptor therefor may specify the two alternatives. Alternatively, two different functional site descriptors may be generated that differ only with respect to the identity constraint at that position. Similar strategies can be employed with regard to other constraints used in a functional site descriptor according to the invention.
To xe2x80x9crelaxxe2x80x9d a constraint refers to the inclusion of a user-defined variance therein. The degree of relaxation will depend on the particular constraint and its application. As those in the art will appreciate, functional site descriptors for the same biological function can be developed wherein different degrees of relaxation for one or more constraints are what differentiate one such descriptor from another.
Protein structures useful in the practice of the invention can be of different quality. The highest quality determination methods are experimental structure prediction methods based on x-ray crystallography and NMR spectroscopy. In x-ray crystallography, xe2x80x9chigh resolutionxe2x80x9d structures are those wherein atomic positions are determined at a resolution of about 2 xc3x85 or less, and enable the determination of the three-dimensional positioning of each atom (or each non-hydrogen atom) of a protein. xe2x80x9cMedium resolutionxe2x80x9d structures are those wherein atomic positioning is determined at about the 2-4 xc3x85 level, while xe2x80x9clow resolutionxe2x80x9d structures are those wherein the atomic positioning is determined in about the 4-8 xc3x85 range. Herein, protein structures that have been determined by x-ray crystallography or NMR may be referred to as xe2x80x9cexperimental structures,xe2x80x9d as compared to those determined by computational methods, i.e., derived from the application of one or more computer algorithms to a primary amino acid sequence to predict protein structure.
As alluded to above, protein structures can also be determined entirely by computational methods, including, but not limited to, homology modeling, threading, and ab initio methods. Often, models produced by such computational methods are xe2x80x9creducedxe2x80x9d models, i.e., the predicted structures (or xe2x80x9cmodelsxe2x80x9d) do not include all non-hydrogen atoms in the protein. Indeed, many reduced models only predict structures that show the polypeptide backbone of the protein, and such models are preferred in the practice of the invention. Of course, it is understood that once a protein structure based on a reduced model has been generated, all or a portion of it may be further refined to include additional predicted detail, up to including all atom positions.
Computational methods usually produce lower quality structures than experimental methods, and the models produced by computational methods are often called xe2x80x9cinexact models.xe2x80x9d While not necessary in order to practice the instant methods the precision of these predicted models can be determined using a benchmark set of proteins whose structures are already known. The predicted model for each biomolecule may then be compared to a corresponding experimentally determined structure. The difference between the predicted model and the experimentally determined structure is quantified via a measure called xe2x80x9croot mean square deviationxe2x80x9d (RMSD). A model having an RMSD of about 2.0 xc3x85 or less as compared to a corresponding experimentally determined structure is considered xe2x80x9chigh qualityxe2x80x9d. Frequently, predicted models have an RMSD of about 2.0 xc3x85 to about 6.0 xc3x85 when compared to one or more experimentally determined structures, and are called xe2x80x9cinexact modelsxe2x80x9d. As those in the art will appreciate, RMSDs can also be determined for one or more atomic positions when two or experimental structures have been generated for the same protein.
The object of this invention is to enable one or more functions of a protein to be predicted from structural information, for example, from computationally derived models of protein structure (including inexact models) produced from deduced primary amino acid sequences, for example, as may be derived from nucleotide sequence of a novel gene obtained in the course of genome sequencing projects.
The present invention comprises a number of objects, aspects, and embodiments.
One aspect of the invention concerns functional site descriptors (FSDS) that define spatial configurations for protein functional sites that correspond with particular biological functions. It is known that function derives from structure. A functional site descriptor according to the invention provides three-dimensional representation of protein functional site. In some embodiments, the functional site represented by an FSD is a ligand binding domain (e.g., a domain that binds a ligand, for example, a substrate, a co-factor, or an antigen), while in other embodiments, the functional site is a protein-protein interaction site or domain. In certain preferred embodiments, the functional site is an enzymatic active site. Particularly preferred embodiments concern functional sites other than a divalent metal ion binding site.
A functional site descriptor typically comprises a set of geometric constraints for one or more atoms in each of two or more amino acid residues comprising a functional site of a protein. Preferably, at least one of said two or more amino acid residues is also identified as a particular amino acid residue or set of amino acid residues. In preferred embodiments, the said one or more atoms is selected from the group consisting of amide nitrogens, xcex1-carbons, carbonyl carbons, and carbonyl oxygens within a polypeptide backbone, xcex2-carbons of amino acid residues, and pseudoatoms. In particularly preferred embodiments, at least one of said one or more atoms is an amide nitrogen, an xcex1-carbon, a xcex2-carbon, or a carbonyl oxygen within a polypeptide backbone.
In certain embodiments, a functional site descriptor represents 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 amino acid residues (or sets of residues) that comprise the corresponding the functional site. While an FSD may include one or more identity constraints with respect to any amino acid, such constraints preferably make reference to naturally occurring amino acids, particularly naturally occurring L amino acids, including those selected from the group consisting of Ala, Arg, Asn, Asp, Cys, Gln, Glu, Gly, His, Ile, Leu, Lys, Met, Phe, Pro, Ser, Thr, Trp, Tyr, and Val.
The geometric constraints of an FSD preferably are selected from the group consisting of an atomic position specified by a set of three dimensional coordinates, an interatomic distance (or range of interatomic distances), and an interatomic bond angle (or range of interatomic bond angles). When a geometric constraint refers to atomic position, reference is typically made to a set of three dimensional coordinates. Such constraints preferably relate to RMSDs, particularly those that allow the atomic position to vary within a preselected RMSD, for example, by an amount of less than about 3 xc3x85, less than about 2.5 xc3x85, less than about 2.0 xc3x85, less than about 1.5 xc3x85, and less than about 1.0 xc3x85.
Other geometric constraints concern interatomic distances, preferably interatomic distance ranges, or interatomic bond angles range preferably interatomic bond angle ranges.
In some embodiments, an FSD can also include one or more conformational constraints that refer to the presence of a particular secondary structure, for example, a helix, or location, for example, near the amino or carboxy terminus of a protein.
In preferred embodiments, an FSD refers to at least one atom from each of 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, or 15 amino acid residues that comprise the corresponding functional site. In many embodiments, all of the atoms for which geometric constraints are provided comprise a part of the polypeptide backbone and are selected from the group consisting of xcex1-carbons, amide nitrogens, carbonyl carbons, and carbonyl oxygens. Of course, one or more of such atoms can be a pseudoatom. Representative examples of pseudoatoms are centers of mass, such as may be derived from at least two atoms, such as two or more atoms from one amino acid residue or two or more atoms from at least two amino acid residues of the protein.
Particularly preferred FSDs are those comprising multiple geometric constraints. Representative examples of such FSDs are a three atom functional site descriptor, a four atom functional site descriptor, a five atom functional site descriptor, a six atom functional site descriptor, a seven atom functional site descriptor, an eight atom functional site descriptor, a nine atom functional site descriptor, a ten atom functional site descriptor, an eleven atom functional site descriptor, a twelve atom functional site descriptor, a thirteen atom functional site descriptor, a fourteen atom functional site descriptor, and a fifteen atom functional site descriptor.
Preferably, FSDs according to the invention are implemented in electronic form.
Certain embodiments of the invention also concern libraries of FSDs, in electronic or other form. Preferably, such a library comprises at least two functional site descriptors for at least one of the biological functions represented by the library.
Another aspect of the invention concerns methods of identifying a protein as having a particular biological function. Such methods may also be referred to as function screening methods. Typically, such methods comprise applying a functional site descriptor according to the invention to a structure of a protein and determining whether the protein has the biological function. This determination is made if application of the functional site descriptor reveals that a portion of the structure of the protein matches, or satisfies, the constraints of the functional site descriptor.
In some embodiments of such methods, the structure(s) to which one or more FSDs is(are) applied is(are) of high resolution. High resolution structures can be obtained by a variety of methods, including x-ray crystallography and nuclear magnetic resonance.
Preferred embodiments involve application of one or more FSDs to predicted protein structures, especially inexact, three dimensional structural protein models. Such models can be generated by a variety of techniques, including by application of an ab initio folding program, a threading program, or a homology modeling program.
FSDs can be applied to a protein structures derived from any organism, be they prokaryotic or eukaryotic. Prokaryotic organisms the proteins of which may be screened according to the instant methods include bacteria. Eukaryotic organisms include plants and animals, particularly those of medical or agricultural import. A representative class is mammals, including bovine, canine, equine, feline, ovine, porcine, and primate animals, as well as humans. The methods may also be applied to study viral protein function.
In certain embodiments, the methods of the invention are practiced using plurality of functional site descriptors and/or
a plurality of proteins structures, of the same or different proteins, preferably to a plurality of structures for a plurality of proteins.
Another aspect of the invention concerns methods of making FSDs for functional sites of proteins (other than divalent metal ion binding sites), which FSDs can then be applied to inexact, three dimensional structural proteins models.
Yet another aspect concerns computer program products comprising a computer useable medium having computer program logic recorded thereon for creating a functional site descriptor for use in predicting a biological function of a protein. Such computer program logic preferably comprises computer program code logic configured to perform a series of operations, including determining a set of geometric constraints for a functional site associated with a biological function of a protein; modifying one or more geometric constraints of said set of geometric constraints to produce a modified set of geometric constraints; comparing said modified set of geometric constraints to a data set of functional sites correlated with said biological function to determine whether said modified set of geometric constraint compares favorably with said data set of functional sites correlated with said biological function and, if so; comparing said modified set of geometric constraint(s) to a data set of functional sites not correlated with said biological function to determine whether said modified set of geometric constraints compares favorably with said data set of functional sites not correlated with said biological function and, if so; repeating said modifying and comparing operations to modify one or more of said geometric constraints of said set of geometric constraints to an extent that said modified set of geometric constraints compares favorably with said data set of functional sites correlated with said biological function without encompassing a predetermined amount of data sets not correlated with said biological function.
In preferred embodiments, the operation of determining a set of geometric constraints of a functional site correlated with a biological function of a protein comprises receiving said set of geometric constraints from at least one of the group of a data set of predetermined geometric constraints or from user input. When modifying one or more geometric constraints of said set of geometric constraints to produce a modified set of geometric constraints, a predetermined variance can be associated with one or more of the geometric constraints to adjust the same.
In preferred embodiments, the operation of modifying one or more geometric constraints of said set of geometric constraints to produce a modified set of geometric constraints comprises computing an average value for a geometric constraint within the set of geometric constraints by determining values for said geometric constraint from two different proteins having functional sites that correlate with said biological function, and calculating said average value; computing a standard deviation with respect to such geometric constraint; and applying a multiplier to said computed standard deviation to generate said modified geometry.
Other features and advantages of the invention will be apparent from the following description of the preferred embodiments thereof, and from the claims.