Specific protein interactions are critical events in most biological processes and a clear idea of the way proteins interact, their three dimensional structure and the types of molecules which might block or enhance interaction are critical aspects of the science of drug discovery in the pharmaceutical industry.
Proteins are made up of strings of amino acids and each amino acid in a string is coded for by a triplet of nucleotides present in DNA sequences (Stryer 1997). The linear sequence of DNA code is read and translated by a cell""s synthetic machinery to produce a linear sequence of amino acids which then fold to form a complex three-dimensional protein.
The mechanisms which govern protein folding are multi-factorial and the summation of a series of interactions between biophysical phenomena and other protein molecules (Stryer 1997). Virtually all molecules signal by non-covalent attachment to another molecule (xe2x80x9cbindingxe2x80x9d). Despite the conceptual simplicity and tremendous importance of molecular recognition, the forces and energetics that govern it are poorly understood. This is owed to the fact that the two primary binding forces (electrostatics and van der Waals interactions) are weak, and roughly of the same order of magnitude. Moreover, binding at any interface is complicated by the presence of solvent (water), solutes (metal ions and salt molecules), and dynamics within the protein, all of which can inhibit or enhance the binding reaction.
In general it is held that the primary structure of a protein determines its tertiary structure. A large volume of work supports this view and many sources of software are available to the scientists in order to produce models of protein structures (Sansom 1998). In addition, a considerable effort is underway in order to build on this principle and generate a definitive database demonstrating the relationships between primary and tertiary protein structures. This endeavour is likened to the human genome project and is estimated to have a similar cost (Gaasterland 1998).
Despite this assembly of background knowledge it is clear that there are considerable limitations in our abilities to predict protein structures and that these become very apparent when computational methods are applied during drug discovery programs. For many experienced practitioners the use of xe2x80x98dockingxe2x80x99 programmes (which seek to examine protein-ligand interactions in detail) are xe2x80x98disappointingxe2x80x99 (Sansom 1998).
Consider this example. A typical growth factor has a molecular weight of 15,000 to 30,000 daltons, whereas a typical small molecule drug has a molecular weight of 300-700. Moreover, X-ray crystal structures of small molecule-protein complexes (such as biotin-avidin) or enzyme-substrates show that they usually bind in crevices, not to flat areas of the protein. Thus relative to enzymes and receptors, protein-protein targets are non-traditional and the pharmaceutical community has had very limited success in developing drugs that bind to them using currently available approaches to lead discovery. High throughput screening technologies in which large (combinatorial) libraries of synthetic compounds are screened against a target protein(s) have failed to produce a significant number of lead compounds.
It is possible that a large portion of the difficulties experienced in attempting to apply such computer programs to drug discovery result from an over-reliance on the consensus dogma that primary structure predicts tertiary structure.
This consensus view of the determinants of protein structure has been re-evaluated in the light of experiments with colicin E1 (Goldstein 1998). This scientific work demonstrated that xe2x80x98modules of secondary structure that make up a given protein are not rigidly constrained in a single set of interactions that lead to a unique three-dimensional structurexe2x80x99 (Goldstein 1998).
The data generated in such studies also presents further issues for large structural projects such as that described by Gaasterland (1998). Proteins are identified and their function ascribed by the homology searches for particular structural elements associated with a given function (e.g. transmembrane domains, enzyme cleavage sites, xcex2-barrel fold etc.). In effect there exists a circular logic to the way in which protein structures are explored and described and this hampers our understanding of the true biological significance since we are only searching for those things we already know.
xe2x80x98Given these considerations, structural genomists might consider assigning a high priority to understanding the extent to which protein-protein and other molecular interactions determine native folding patterns before their databases get too fullxe2x80x99 (Goldstein 1998).
The binding of large proteinaceous signaling molecules (such as hormones) to cellular receptors regulates a substantial portion of the control of cellular processes and functions. These protein-protein interactions are distinct from the interaction of substrates to enzymes or small molecule ligands to seven-transmembrane receptors. Protein-protein interactions occur over relatively large surface areas, as opposed to the interactions of small molecule ligands with serpentine receptors, or enzymes with their substrates, which usually occur in focused xe2x80x9cpocketsxe2x80x9d or xe2x80x9cclefts.xe2x80x9d
Many major diseases result from the inactivity or hyperactivity of large protein signaling molecules. For example, diabetes mellitus results from the absence or ineffectiveness of insulin, and dwarfism from the lack of growth hormone. Thus, simple replacement therapy with recombinant forms of insulin or growth hormone heralded the beginnings of the biotechnology industry. However, nearly all drugs that target protein-protein interactions or that mimic large protein signaling molecules are also large proteins. Protein drugs are expensive to manufacture, difficult to formulate, and must be given by injection or topical administration.
It is generally believed that because the binding interfaces between proteins are very large, traditional approaches to drug screening or design have not been successful. In fact, for most protein-protein interactions, only small subsets of the overall intermolecular surfaces are important in defining binding affinity.
xe2x80x98One strongly suspects that the many crevices, canyons, depressions and gaps, that punctuate any protein surface are places that interact with numerous micro- and macro-molecular ligands inside the cell or in the extra-cellular spaces, the identity of which is not knownxe2x80x99 (Goldstein 1998).
Despite these complexities, recent evidence suggests that protein-protein interfaces are tractable targets for drug design when coupled with suitable functional analysis and more robust molecular diversity methods. For example, the interface between hGH and its receptor buries xcx9c1300 Sq. Angstroms of surface area and involves 30 contact side chains across the interface. However, alanine-scanning mutagenesis shows that only eight side-chains at the center of the interface (covering an area of about 350 Sq. Angstroms) are crucial for affinity. Such xe2x80x9chot spotsxe2x80x9d have been found in numerous other protein-protein complexes by alanine-scanning, and their existence is likely to be a general phenomenon.
The problem therefore is to define the small subset of regions that define the binding or functionality of the protein.
The important commercial reason for this is that a more efficient way of doing this would greatly accelerate the process of drug development.
These complexities are not insoluble problems and newer theoretical methods should not be ignored in the drug design process. Nonetheless, in the near future there are no good algorithms that allow one to predict protein binding affinities quickly, reliably, and with high precision (Sunesis website 17/9/99).
The invention provides a method and a software tool for processing sequence data and a method and a software tool for protein structure analysis, and the data forming the product of each method, as defined in the appended independent claims to which reference should be made. Preferred or advantageous features of the invention are set out in dependent subclaims.
The invention provides a method and a software tool for use in analysing and manipulating sequence data (e.g. both DNA and protein) such as is found in large databases (see Table 1). Advantageously it may enable the conducting of systematic searches to identify the sequences which code for key intermolecular surfaces or xe2x80x9chot spotsxe2x80x9d on specific protein targets.
This technology may advantageously have significant applications in the application of informatics to sequence databases in order to identify lead molecules for important pharmaceutical targets.
DNA is composed of two helical strands of nucleotides (see FIG. 10). The concepts governing the genetic code and the fact that DNA codes for protein sequences are well known (Stryer 1997). The xe2x80x98sensexe2x80x99 strand codes for the protein, and as such, attracts all the attention of molecular biologists and protein chemists alike. The purpose of the other xe2x80x98anti-sensexe2x80x99 strand is more elusive. To most, its function is relegated to that of a molecular xe2x80x98supportxe2x80x99 for the xe2x80x98sensexe2x80x99 strand, which is used when DNA is replicated (Stryer 1997) but is of little immediate functional significance for the day to day activities of cellular processes.
Some research would suggest a greater role of the antisense strand of DNA above that of the basic conceptual model of replication. In particular, it had been noticed that there appeared to be a potential functional relationship between sense and anti-sense strands in viruses. Mekler (1969) observed that several minus stranded virus complexes contained protein components translated from the mRNA complementary to the RNA of the viral gene. Mekler postulated that the significance of this finding was that because this viral protein interacts strongly with the RNA from which the mRNA was generated, a peptide chain may associate specifically with the coding strand of its own gene. It was later thought that this may provide a rationale for the ability of a protein to regulate the transcription of its own gene.
Mekler""s original theory was supported by studies on antigen processing pathways. Specifically, an antibody-synthesizing RNA complex was found to bind to its antigen with high affinity (Fishman and Adler, 1967). Mekler contended that these results demonstrated the ability of a protein antigen to regulate its own synthesis by binding to the mRNA encoding the antibody (Mekler, 1969). As the binding between the active centre of the antibody and the antigenic determinant is well known to be based on associations of polypeptide chains, he purported that two interacting polypeptides may be encoded in complementary strands of DNA (FIG. 11). Mekler also analysed the proposed interacting regions of pancreatic ribonuclease A and recorded that reading the complementary RNA of one of the interacting chains in the 5xe2x80x2-3xe2x80x2 direction yielded the sequence of the other interactant. From these observations he suggested that there existed a specific code of interaction between amino acid side chains encoded by complementary codons at the RNA level (Table 2).
Collectively, these observations represented the first predictions of a sense-complementary peptide-binding complex.
One key feature of Mekler""s theory was that due to the degeneracy of the genetic code one amino acid may be complementary related to as many as four others, allowing for a large variety of possible interacting sequences (Table 2).
In 1981, Mekler revised his original theory and described a xe2x80x98general stereochemical genetic codexe2x80x99 (Mekler and Idlis, 1981) in which it was reported that the complementary pairings detailed in the above table formed three distinct groupings (FIG. 11).
Mekler noted that, in general, amino acids with non-polar side chains were related by complementary code to amino acids with polar side chains. He did not provide an explanation for this. Further theoretical considerations on the possibility of complementary-sense peptide recognition were independently developed by Biro (1981), Root-Bernstein (1982) and Blalock and Smith (1984). Biro (1981) conducted a computational comparison of DNA sequences: encoding protein ligand-receptor segments and showed that there were many complementary regions between them, giving rise to complementary related polypeptides.
Blalock and Smith (1984) observed that the hydropathic character of an amino acid residue is related to the identity of the middle letter of the triplet codon from which it is transcribed. Specifically, a triplet codon with thymine (T) as its middle base codes for a hydrophobic residue whilst adenine (A) codes for a hydrophilic residue. A triplet codon with middle bases cytosine (C) or guanine (G) encode residues which are relatively neutral and with similar hydropathy scores. Hydropathy is an index of the affinity of an amino acid for a polar environment, hydrophilic residues yielding a more negative score, whilst hydrophobic residues exhibit more positive scores. Kyte and Doolittle (1982) conceived the most widely used scale of this type. The observed relationship between the middle base of a triplet codon and residue hydropathy entails that peptides encoded by complementary DNA will exhibit complementary, or inverted, hydropathic profiles.
It was proposed that because two peptide sequences encoded in complementary DNA strands display inverted hydropathic profiles, they may form amphipathic secondary structures, and bind to one another (Bost et al., 1985).
Complementary peptides have been reported to form binding complexes with their xe2x80x98sensexe2x80x99 peptide counterparts (Root-Bernstein and Holsworthy, 1998). Evidence of such an interaction has now been reported for over forty different systems from many different authors (Table 3).
The reports listed cite experiments showing specific interactions between complementary peptide pairs. As such they demonstrate a variety of ways in which these peptide ligands may be utilised.
The scope of this analysis for explaining the interactions between proteins was further developed by Blalock to propose a Molecular Recognition Theory (MRT) (Bost and Blalock 1985, Blalock 1995, FIG. 13). This theory suggests that a xe2x80x98molecular recognitionxe2x80x99 code of interaction exists between peptides encoded by complementary strands of DNA based on the observation that such peptides will exhibit inverted hydropathic profiles.
Blalock suggested that it is the linear pattern of amino acid hydropathy scores in a sequence (rather than the combination of specific residue identities), that defines the secondary structure environment. Furthermore, lie suggested that sequences with inverted hydropathic profiles are complementary in shape by virtue of inverse forces determining their steric relationships.
As a corollary to his original work, Blalock contended that as well as reading a complementary codon in the usual 5xe2x80x2-3xe2x80x2 direction, reading a complementary codon in the 3xe2x80x2-5xe2x80x2 would also yield amino acid sequences that displayed opposite hydropathic profiles (Bost et al., 1985). This follows from the observation that the middle base of a triplet codon determines the hydropathy index of the residue it codes for, and thus reading a codon in the reverse direction may change the identity, but not the hydropathic nature of the coded amino acid (Table 4).
Statistical studies at the DNA level must take into account the degeneracy of the genetic code as it allows for the existence of larger inter- or intramolecular complementary sequences without maintaining complementarity at the DNA level. In this vein, recent work by Baranyi et al. (1995) details a new protein structural motif called the Antisense Homology Box (AHB). Following an analysis of a protein sequence data bank for possible intramolecular complementary pairs, it was noted that there are many more regions of complementary peptide complementarity within the structures than statistically expected.
The reported frequency of these motifs is, on average, one per fifty residues. AHB areas have already been shown to be able to act as molecular recognition sites by studies involving function inhibition with peptide complements. Specifically, the endothelin peptide (ET-1) was inhibited by a 14 residue fragment of the endothelin A receptor in a smooth muscle relaxation assay (Baranyi et al., 1996), whilst complementary encoded regions of the C5a receptor antagonize C5a anaphylatoxin (Baranyi et al., 1996). These studies suggest that many interactions in nature may result from contacts between complementary related polypeptides.
Several investigations have been directed at gaining an understanding of how hydropathic profiles and binding constants between complementary peptides are connected. The most comprehensive of these was carried out by Fassina et al. (1989) who studied the relationship between a complementary peptide designed on a computer to maximize complementary hydropathy against a thirteen-residue section of a glycoprotein. The study demonstrates a positive correlation between binding constants, as determined by an affinity binding column assay, and the degree of hydropathic complementarity, implying that a peptide""s hydropathic character is inextricably linked to the binding mechanism.
This interesting result suggests that binding between two complementary related peptides is determined solely by the hydropathicity. Importantly, it also suggests that the steric nature of the side chain alone does not directly influence the ability of peptides to recognise each other, for in general, residues with similar hydropathic character display a wide distribution of side chain shapes and sizes.
The generation of a complementary peptide is straightforward in cases where the DNA sequence information is available. The complementary base sequence is read in either the 5xe2x80x2-3xe2x80x2 or 3xe2x80x2-5xe2x80x2 direction and translated to the peptide sequence according to the genetic code. In the absence of knowledge of the nucleotide sequence of the sense peptide, many possible permutations of complementary sequences exist, in accordance with the degeneracy of the genetic code (as shown in Tables 2 and 4).
Several approaches to define complementary sequences in such instances have been proposed:
One such approach makes a series of educated guesses based on the use of preferred codon usage tables (Aota et al. 1988) which allows one to assess the probability of a particular codon to be used for each amino acid for a given sequence.
Another approach, where applicable, is to assign the complementary residue to the amino acid which is the most frequent out of all the theoretical complementary residues.
Thus, in a situation where the DNA sequence is unknown, the possible complementary amino acids for a leucine residue are glutamine (3 possible codons), stop (2 possible codons), glutamic acid (1 possible codons) and lysine (1 possible codon). In this case glutamine would be chosen on the basis of statistical weight. Information such as this, along with the use of codon usage tables leads to a consensus approach to limiting the number of possible combinations of complementary sequences. Bost and Blalock (1989), Omichinski et al. (1989) and Shai et al. (1989) have employed methods of this type.
A number of studies have demonstrated the value of this type of approach to designing peptides with real functional utility.
Although some very high affinities have been reported for these peptides (Kdxcx9c10xe2x88x929 M), most are of moderate affinity (Kdxcx9c10xe2x88x923-10xe2x88x927M). Their potential applications therefore would depend on the affinity attained in a particular system. Lower affinity complementary peptides may be useful for diagnostic tests or for purification of ligands. Higher affinity peptides may serve a purpose in the development of therapeutics, for example a complementary peptide to a coat protein of a virus may interfere with the virus-host interaction at the molecular level, thus providing a strategy to manage this type of disorder.
Although the importance of inverted hydropathy in protein-protein interactions has long been recognized (Blalock and Smith, 1984) there has been little activity to apply this method on a large scale to investigate the complementary peptide partners of many proteins. One such attempt is recorded in the literature. xe2x80x9cIn the design of computer-based mining tools, no attention has been paid to a unique feature in the genetic code that determines the basic physico-chemical character of the encoded amino acidsxe2x80x9d (Kohler and Blalock, 1998). They proposed a method to scan DNA sequence banks using the hydropathic binary code, U.S. Pat. No. 5,523,208. The method described differs from the current invention as outlined below.
The current invention finds regions of potentially interacting amino acid sequences by using the relationships outlined in Tables 2 and 4. U.S. Pat. No. 5,523,208 determines regions of potentially interacting peptides by an altogether different method, that of hydropathy scoring. The results of analyses are thus completely different.
The process (algorithms) in by which sequences are analysed are different in the current invention than described U.S. Pat. No. 5,523,208. In particular, the current invention describes different algorithms for the analysis of complementary regions between proteins, or within proteins.
The current problems associated with design of complementary peptides are:
A lack of understanding of the forces of recognition between complementary peptides
An absence of software tools to facilitate searching and selecting complementary peptide pairs from within a protein database.
A lack of understanding of statistical relevance/distribution of naturally encoded complementary peptides and how this corresponds to functional relevance.
Based on these shortfalls, embodiments of the invention describes the following technological advances in this field:
A mini library approach to define forces of recognition between human Interleukin (IL) 1xcex2 and its complementary peptides;
A high throughput computer system to analyse an entire database for intra/inter-molecular complementary regions; and
A novel (computational) method of analyzing X-ray crystal files for potential discontinuous complementary binding sites.
Studies into preferred complementary peptide pairings between IL-1xcex2 and its complementary ligand reveal the importance of both the genetic code and complementary hydropathy for recognition. Specifically, for our example, the genetic code for a region of protein codes for the complementary peptide with the highest affinity. An important observation is that this complementary peptide maps spatially and by residue hydropathic character to the interacting portion of the IL-1R receptor, as elucidated by the X-ray crystal structure Brookhaven reference pdb2itb.ent.
Using these novel observations as guiding principles for analysis, we have developed a computational analysis system to evaluate the statistical and functional relevance of intra/inter- molecular complementary sequences.
This invention provides significant benefits for those interested in:
The analysis and acquisition of peptide sequences to be used in the understanding of protein-protein interactions.
The development of peptides or small molecules which could be used to manipulate these interactions.
The advantages of this invention to previous work in this field include:
Using a valid statistical model. Previously, complementary mappings within protein structures has been statistically validated by assuming that the occurrence of individual amino acids is equally weighted at 1/20 (Baranyi, 1995). Our statistical model takes into account the natural occurrence of amino acids and thus generates probabilities dependent on sequence rather than content per se.
Facilitation of batch searching of an entire database. Previously, investigations into the significance of naturally encoded complementary related sequences have been limited to small sample sizes with non-automated methods. The invention allows for analysis of an entire database at a time, overcoming the sampling problem, and providing for the first time an overview or xe2x80x98mapxe2x80x99 of complementary peptide sequences within known protein sequences.
The ability to map complementary sequences as a function of frame size and percentage antisense amino acid content. Previously, no consideration has been given to the significance of the frame length of complementary sequences. Our invention produces a statistical map as a function of frame size and percentage complementary residue content such that the statistical importance of how nature selects these frames may be evaluated.