Proteins are central to life due to their crucial involvement in a variety of biological processes, such as enzyme catalysis of biochemical reactions, control of nucleic acid transcription and replication, hormonal regulation, signal transduction cascades and antigen recognition during immune responses.
In many cases, one or more structural regions of a protein are responsible for a particular function, hereinafter referred to as “functional regions”. These regions may constitute the active site of a protein enzyme, the nucleic acid binding domain of a transcription factor, a region of a protein cytokine crucial to binding the specific receptor for that cytokine, or antigen-binding regions of antigen receptors.
A functional region of a protein usually comprises one or more amino acids which are required for that particular function, that is, they are essential for that function.
In many cases, although these required amino acid residues are topographically proximal to each other, they may be well separated with respect to primary amino acid sequence, that is, they are non-contiguous. In addition, where there is more than one functional region of a protein, these regions may also be topographically proximal, but well separated in terms of primary amino acid sequence. In some cases, however, where there is more than one functional region involved in a particular function, these functional regions may also be topographically well separated. This is a particularly important point with regard to the functional regions of cytokines.
“Cytokine” as used herein includes and encompasses soluble protein molecules which have a cognate cell surface receptor, and which are involved in initiating, controlling and otherwise regulating a variety of processes relevant to cell growth, death and differentiation. Cytokines are typically exemplified by interferons (e.g. IFN-γ), interleukins (for example IL-2, IL-4 and IL-6), growth and differentiation factors [e.g. granulocyte colony stimulating factor (G-CSF) and erythropoietin (EPO)] and others such as growth hormone (GH), prolactin, TGF-β, tumour necrosis factor (TNF) and insulin. Each of these molecules is capable of binding a specific receptor and thereby eliciting a particular biological response or set of responses.
The fact that a particular function of a protein can be attributed to one or more functional regions of that protein has formed the basis for strategies aimed at modifying a protein by adding or subtracting functional regions to modify the function of that protein.
In this regard, the design and engineering of cytokine mimetics has become an area of major importance, as many cytokine—cytokine receptor interactions are central to the regulation of a variety of biological processes. It is envisaged that new mimetics will therefore become important new therapeutic agents that either mimic or inhibit the biological response to cytokine—cytokine receptor interactions.
A “mimetic” is a molecule which elicits a biological response either similar to, or more powerful than, that of another molecule (an “agonist”), or inhibits the action of the other molecule (an “antagonist”). The other molecule may be a cytokine, for example.
With regard to designing and engineering mimetics based on cytokines, a problem frequently encountered with many engineered mimetics has been that they exhibit short biological half-lives and hence minimal bioavailability and efficacy. In this regard, it has been proposed that small cysteine-rich proteins might be useful as protein “scaffolds” as a basis for engineering mimetics, due to their stability (Vita et al., 1995, Proc. Natl. Acad. Sci. USA 92 6404). These small cysteine-rich proteins comprise a disulfide-bonded core and exposed amino acid side chains at the protein surface (Neilsen et al., 1996, J. Mol. Biol. 263 297). However the full potential of these proteins has not been realized due to the fact that typical prior art strategies for protein engineering have largely been limited to transferring or exchanging contiguous groups of amino acids within individual secondary structural elements, such as loops or helices or β-sheets and no design strategies exist for selecting the most appropriate disulfide-rich candidiate.
Examples of such an approach would include: the exchange of secondary structural regions between RNase and angiogenin, either to confer RNase activity on angiogenin (Harper et al., 1989, Biochemistry 28 1875) or angiogenic activity on RNase (Raines et al., 1995, J. Biol. Chem. 27017180); the insertion of elastase inhibition activity into IL-1β by transfer of the protease inhibitor loop of elastase to the IL-1β scaffold (Wolfson et al., 1993, Biochemistry 32 5327); the insertion of a 10 amino acid calcium-binding loop of thermolysin into Bacillus subtilis neutral protease (Toma et al., 1991, Biochemistry 30 97); the insertion of a β-sheet from a snake toxin to replace the β-sheet of charybdotoxin (Drakopolou et al., 1996, J. Biol. Chem. 271 11979); and the incorporation of a β-sheet from carbonic anhydrase into the β-sheet of charybdotoxin (Pierret et al., 1995, J. Med. Chem. 35 2145).
Of growing importance in protein engineering has been the use of computer based technology combined with the elucidation of the 3D structures of small molecules and macromolecules. 3D molecular structures are being generated at an increasing rate, such as by X-Ray crystallography and NMR techniques. These 3D features can be stored in generally accessible, searchable databases, such as the BROOKHAVEN database.
For the purposes of this specification, a database will comprise a collection of “entries”, each entry corresponding to a representation of an aspect of 3D structure of a framework protein. A framework protein is simply any protein for which a 3D structure exists, either by experimental elucidation or by predictive means such as computer modelling. A framework protein is potentially useful as a scaffold which can be structurally modified for the purposes of imparting a particular function thereto.
A “query” refers herein to a representation of an aspect of 3D structure of a protein which exhibits a function of interest. The representation of 3D structure would be in a form suitable for searching a database with the intention of identifying a “hit”. A hit is an entry identified according to the particular query and the algorithm used to perform the search.
An important advance in database searching has been made by representing 3D structures in terms of the relationship between atoms located in “distance space”, rather than “Cartesian space” (Jakes & Willett, 1986, J. Mol. Graphics 4 12; Ho & Marshall, 1993, J. Comp. Aided. Mol. Des. 7 3). A location in Cartesian space is defined by three coordinates (x, y, z) which each correspond to a position along three respective axes (X, Y, Z), each axis being oriented at right angles to the other two.
A location in distance space, however, is defined by distances between atoms, expressed in the form of a distance matrix, which details the distance between atoms. Distance matrices are therefore coordinate independent, and comparisons between distance matrices can be made without restriction to a particular frame of reference, such as is required using Cartesian coordinates.
It is important to emphasise that an arrangement of atoms and its mirror image are described by identical distance matrices. A root mean squared (RMS) difference can be used to alleviate this ambiguity.
With regard to the 3D structure of proteins, a simplification of protein structure can be provided by reducing a 3D structure to “Cα–Cβ vectors” as discussed in McKie et al., 1995, Peptides: Chemistry, Structure & Biology p 354–355. A Cα–Cβ vector occupies a location in 3D space, the location being defined by the orientation of the covalent bond between the α carbon and β carbon atoms of an amino acid (Lauri & Bartlett, 1994, J. Comp. Aid. Mol. Des. 8 51). It will be appreciated that each of the 20 naturally-occurring constituent amino acids of a protein (except glycine), possess a Cα–Cβ vector due to the covalent bond between the “central” α carbon and the β carbon of the constituent side chain.
For those proteins containing Gly in the database, it is possible to mutate this to Ala to generate the required Cα–Cβ vector for database searching.
The usefulness of Cα–Cβ vectors is that they provide a simplification of 3D structure. Therefore, only the amino acid side-chains of a functional region of a protein need be represented by the Cα–Cβ vector map, thereby excluding the substantial portion of the protein(s) not directly involved in that particular function. For the purposes of database searching, Cα–Cβ vectors are ideal, as they constitute the basic 3D structural information needed.
After identification of Cα–Cβ vectors corresponding to a protein or a functional region thereof, the parameters that characterize each vector must be stored in a database in such a way that retrieval in response to a query can be made quickly. A number of options are available for suitable representation of Cα–Cβ vectors, whether as a database entry or as a query:—                (A) as a distance matrix;        (B) as a dihedral angle (δ) formed between respective Cα–Cβ vectors;        (C) as angles α1 and α2 formed between respective Cα–Cβ vectors.        
A simple explanation of these representations is provided in Lauri & Bartlett, 1994, supra, which is hereinafter incorporated by reference. The key to successful database searching is speed and efficiency. Thus, computer search algorithms have been developed which use a strategy whereby the vast majority of entries in the database are eliminated in a preliminary screening step.
These algorithms are demanding of computer resources, and therefore a search is normally effected in two stages:—                (1) a screening search to eliminate entries that cannot possibly constitute a hit; and        (2) an atom-by-atom comparison of a query with each entry not eliminated in (1), to identify one or more hits.        
The search in (1) could screen entries based on geometric attributes of the query (Lesk, 1979, Commun. ACM 22 219) interatomic distances and atom types (Jakes & Willett, 1986, supra), aromaticity, hybridization, connectivity, charge, position of lone pair electrons, or centre of mass of ring structures (Sheridan et al., 1989, Proc. Natl. Acad. Sci. 86 8165). This screening process would eliminate entries that have no chance of meeting the 3D constraints of the query.
This strategy, although quick, requires that for an entry to register as a hit, it must comprise every specified query component. As the number of query components increases, the number of near misses increases and the likelihood of finding a hit decreases.
A more useful search strategy which assesses the relative merits of each near miss as well as each hit has recently been provided by the search program FOUNDATION (Ho & Marshall, 1993, supra). FOUNDATION uses a clique-detection algorithm (various algorithms are reviewed and compared in Brint & Willett, 1987, J. Mol. Graphics 5 49 and Brint & Willett, 1987, Chem. Inf. Comput. Sci. 27 152) which searches a 3D database of entries for a user-defined query consisting of the coordinates of various atoms and/or bonds of a 3D structural feature. FOUNDATION identifies all possible entries that contain any combination of a user-specified minimum number of matching atoms and/or bonds as hits.
Despite the usefulness of 3D database searching as a means of identifying structurally related proteins, this approach has not been well utilized with respect to engineering proteins with a desired function.