Protein insolubility constitutes a significant problem in basic and applied bioscience, in many situations limiting the rate of progress in these areas. Protein folding and solubility has been the subject of considerable theoretical and empirical research. However, there still exists no general method for improving intrinsic protein folding and solubility. Such a method would greatly facilitate protein structure-function studies, drug design, de novo peptide and protein design and associated structure-function studies, industrial process optimization using bioreactors and microorganisms, and many disciplines in which a process or application depends on the ability to tailor or improve the solubility of proteins, screen or modify the solubility of large numbers of unique proteins about which little or no structure-function information is available, or adapt the solubility of proteins to new environments when the structure and function of the protein(s) are poorly understood or unknown.
Overexpression of cloned genes using an expression host, for example E. coli, is the principal method of obtaining proteins for most applications. Unfortunately, many such cloned foreign proteins are poorly folded, insoluble and/or unstable when overexpressed. There are two sets of approaches currently in use which deal with such proteins. One set of approaches modifies the environment of the protein in vivo and/or in vitro. For example, proteins may be expressed as fusions with more soluble proteins, or directed to specific cellular locations. Chaperons may be coexpressed to assist folding pathways. Insoluble proteins may be purified from inclusion bodies using denaturants and the protein subsequently refolded in the absence of the denaturant. Modified growth media and/or growth conditions can sometimes improve the folding and solubility of a foreign protein. However, these methods are frequently cumbersome, unreliable, ineffective, or lack generality. A second set of approaches changes the sequence of the expressed protein. Rational approaches employ site-directed mutation of key residues to improve protein stability and solubility. Alternatively, a smaller, more soluble fragment of the protein may be expressed. These approaches require a priori knowledge about the structure of the protein, knowledge which is generally unavailable when the protein is insoluble. Furthermore, rational design approaches are best applied when the problem involves only a small number of amino-acid changes. Finally, even when the structure is known, the changes required to improve solubility may be unclear. Thus, many thousands of possible combinations of mutations may have to be investigated leading to what is essentially an “irrational” or random mutagenesis approach. Such an approach requires a method for rapidly determining the solubility of each version.
Random or “irrational” mutagenesis redesign of protein solubility carries the possibility that the native function of the protein may be destroyed or modified by the inadvertent mutation of residues which are important for function, but not necessarily related to solubility. However, protein solubility is strongly influenced by interaction with the environment through surface amino acid residues, while catalytic activities and/or small substrate recognition often involve partially buried or cleft residues distant from the surface residues. Thus, in many situations, rational mutation of proteins has demonstrated that the solubility of a protein can be modified without destroying the native function of the protein. Modification of the function of a protein without effecting its solubility has also been frequently observed. Furthermore, spontaneous mutants of proteins bearing only 1 or 2 point mutations have been serendipitously isolated which have converted a previously insoluble protein into a soluble one. This suggests that the solubility of a protein can be optimized with a low level of mutation and that protein function can be maintained independently of enhancements or modifications to solubility. Furthermore, a screen for function may be applied concomitantly after each round of solubility selection during the directed evolution process.
In the absence of a screen for function, for example when the function is unknown, the final version of the protein can be backcrossed against the wild type in vitro to remove nonessential mutations. This approach has been successfully applied by Stemmer in “Rapid Evolution Of A Protein In Vitro By DNA Shuffling,” by W. P. C. Stemmer, Nature 370, 389 (1994), and in “DNA Shuffling By Random Fragmentation And Reassembly: In Vitro Recombination For Molecular Evolution,” by W. P. C. Stemmer, Proc. Natl. Acad. Sci. USA 91, 10747 (1994) to problems in which the function of a protein had been optimized and it was desired to remove nonessential mutations accumulated during directed evolution. The development of highly specialized protein variants by directed, in vitro evolution, which exerts unidirectional selection pressure on organisms, is further discussed in: “Searching Sequence Space: Using Recombination To Search More Efficiently And Thoroughly Instead Of Making Bigger Combinatorial Libraries,” by Willem P. C. Stemmer, Biotechnology 13, 549 (1995); in “Directed Evolution: Creating Biocatalysts For The Future,” by Frances H. Arnold, Chemical Engineering Science 51, 5091 (1996); in “Directed Evolution Of A Fucosidase From A Galactosidase By DNA Shuffling And Screening,” by Ji-Hu Zhang et al., Proc. Natl. Acad. Sci. USA 94, 4504 (1997); in “Functional And Nonfunctional Mutations Distinguished By Random Combination Of Homologous Genes,” by Huimin Zhao and Frances H. Arnold, Proc. Natl. Acad. Sci. USA 94, 7007 (1997); and in “Strategies For The In Vitro Evolution of Protein Function: Enzyme Evolution By Random Recombination of Improved Sequences”, by Jeff Moore et al., J. Mol. Biol. 272, 336-346 (1997). Therein, efficient strategies for engineering new proteins by multiple generations of random mutagenesis and recombination coupled with screening for improved variants is described.
In order to use directed evolution to improve the folding of a protein of interest, it can be fused to a folding reporter. When poorly folding proteins are fused to such folding reporters, they adversely affect the function of the reporter proteins to which they are fused, by trapping them in aggregated or unfolded non-functional states. When the folding reporter has an easily identifiable phenotype, such as antibiotic resistance1-6, fluorescence7 or color complementation8 it is relatively straightforward to identify or select bacteria expressing protein fragments which are soluble and well folded following directed evolution, by using the phenotype of the fused folding reporter. This approach has been applied to the selection of mutated versions of naturally insoluble or poorly expressed proteins4,7,9.
GFP and its numerous related fluorescent proteins are now in widespread use as protein tagging agents (for review, see Verkhusha et al., 2003, GFP-like fluorescent proteins and chromoproteins of the class Anthozoa. In: Protein Structures: Kaleidescope of Structural Properties and Functions, Ch. 18, pp. 405-439, Research Signpost, Kerala, India). GFP-like proteins are an expanding family of homologous, 25-30 kDa polypeptides sharing a conserved 11 beta-strand “barrel” structure. The GFP-like protein family currently comprises well over 100 members, cloned from various Anthozoa and Hydrozoa species, and includes red, yellow and green fluorescent proteins and a variety of non-fluorescent chromoproteins. A wide variety of fluorescent protein labeling assays and kits are commercially available, encompassing a broad spectrum of GFP spectral variants and GFP-like fluorescent proteins, including DsRed and other red fluorescent proteins (Clontech, Palo Alto, Calif.; Amersham, Piscataway, N.J.).
Wild type green fluorescent protein (GFP) cloned from Aequorea victoria, normally misfolds and is poorly fluorescent when overexpressed in the heterologous host E. coli. It is found predominantly in the inclusion body fraction of cell lysates. The misfolding is incompletely understood, but is thought to result from the increased expression level or rate in E. coli, or the inadequacy of the bacterial chaperone and related folding machinery under conditions of overexpression. The folding yield also decreases dramatically at higher temperatures (37° C. vs. 27° C.). This wild type GFP is a very poor folder, as it is extremely sensitive to the expression environment.
DNA shuffling has been used to obtain a GFP mutant having a whole cell fluorescence 45-times greater than the standard, commercially available plasmid GFP. See, e.g., “Improved Green Fluorescent Protein By Molecular Evolution Using DNA Shuffling,” by Andreas Crameri et al., Nature Biotechnology 14, 315 (1996). The screening process optimizes the function of GFP (green fluorescence), and thus uses a functional screen. The screening process coincidentally optimizes the solubility of the GFP, in that the GFP is only fluorescent when properly folded, this being the basis for the use of GFP as a folding reporter.
It has been demonstrated that improving the apparent functionality of a protein can sometimes increase the concomitant solubility of the protein, as in: “Redesigning enzyme topology by directed evolution,” by G. Macbeath, P. Kast, and D Hilvert, Science 279, 1958-1961 (1998); “Expression of an antibody fragment at high levels in the bacterial cytoplasm,” by P. Martineau, P. Jones, and G. Winter, J. Mol. Biol. 280, 117-127 (1998); “Antibody scFv fragments without disulfide bonds made by molecular evolution,” K. Proba, A. Worn, A. Honegger, and A. Pluckthun, J. Mol. Biol. 275, 245-253 (1998); and “Functional Expression of Horseradish Peroxidase in E. coli by Directed Evolution,” Lin Zhanglin, Todd Thorsen, and Frances H. Arnold, Biotechnol. Prog. 15, 467-471 (1999). In each case, the driving force for the directed evolution was the functionality of the protein of interest. For example, if the protein was an enzyme, the assay for improved function was the turnover of a chromogenic analog of the enzyme's natural substrate; if the protein was an antibody, it was the recognition of the target antigen by the antibody.
For cytoplasmic expression of antibodies, the recognition was linked to cell survival, (binding of the antibody to a selectable protein marker which was an antigen for the antibody of interest providing selection for functional antibodies); in the case of phage displayed antibodies without disulfide bonds, the recognition was transduced to successful binding of the displayed phage to the target antigen of the displayed antibody in a biopanning protocol. An apparent increase in the amount of protein expressed in the soluble fraction relative to the unselected target proteins was noted upon expression of the proteins in E. coli. The apparent increase in activity of desirable mutants during the evolution was due at least in part to an increase in the number of correctly folded (and hence functional) protein molecules, and not exclusively to an increase in the specific activity of a given protein molecule. However, the driving force for the selection or screening process during the directed evolution depended on the functionality (and functional assay for) the protein of interest.
Many proteins have no easily detectable functional assay, and thus identification of proteins with improved folding yield by an increase in apparent activity due to a larger number of correctly folded molecules, is not a general method for improving folding by directed evolution. Furthermore, even when functional assays are available, apparent increases in activity can also be due to increases in the specific activity (activity of an individual protein molecule) even when the total number of correctly folded molecules remains the same. Thus, increases in apparent activity do not necessarily translate to increases in the solubility of proteins. Furthermore, functional assays are protein-specific, and thus must be developed on a case-by-case basis for each new protein. Functional assays therefore lack the generality needed to identify proteins which are soluble, or to find genetic variants (mutants and fragments) of proteins with improved solubility, in a high-throughput manner for proteomics or functional genomics wherein large numbers of different proteins about which little or no functional/structural information is known, are to be solubly expressed.
A number of different methods have been developed to create thermostable proteins, most of which involve the creation of libraries and the identification of improved proteins by selection or screening. Conceptually, the most straightforward way to identify proteins with improved thermostability has been to apply a thermal challenge to a collection of individual clones and test the remaining functionality of the clones, repeating this process if necessary, to combine useful mutations10, 11. A similar method, which does not rely on such extensive screening requirements, involves direct selection of clones growing at elevated temperature within thermophilic bacteria. However, to date, this method has only been applied to the selection of thermophilic antibiotic resistance proteins13,14, and as laboratory organisms typically do not grow at elevated temperatures, it has been difficult to generalize. As a result, considerable effort has been put into the development of alternative approaches which involve selection or screening for biophysical or biological properties which can serve as surrogates for, and are often correlated with, thermostability.
One of the first examples of this approach is the PROSIDE (protein stability increased by directed evolution)15-22 approach in which resistance to protease digestion is used as the surrogate property for protein stability, with filamentous phage infectivity being the selection modality. Proteins under test are expressed between two domains in g3p (the phage receptor for bacteria): if they are cleaved by protease, the filamentous phage loses the N terminal g3p domain and consequently its ability to infect; if the protein is protease resistant infectivity is maintained. This has been successfully used to increase the stability of the beta1 domain of protein G17, the cold shock protein of B. subtilis19 and ribonuclease T115. In another approach involving directed evolution, Shusta et al., showed that the display levels of heterologous proteins on the surface of yeast correlated with expression levels and thermal stability23, although exceptions to this have been recently described24.
Consensus engineering25,26 is an approach to increase protein stability which does not use directed evolution, but the informational content of aligned sequences. By modifying a sequence so that it more closely resembles a consensus derived from the alignment of numerous proteins of a particular family, it has been found that significant increases in stability can be obtained. This has been applied to antibodies and antibody fragments26-34, GroEL minichaperones35-36, p5337, WW38 and SH3 domains39. More recently consensus engineering has been applied to the creation of novel proteins, rather than the stepwise modification of pre-existing ones to resemble a consensus. Perhaps the most striking success was the application to phytases40-42, in which a final protein with a Tm of 90.4° C. was obtained: 52° C. greater than the best component parental sequence43. Similar stability was obtained with a consensus ankyrin sequence based on the alignment of 2000 different ankyrins44-46. We recently applied this method to the creation of a consensus green protein (CGP)47. Although we obtained a functional fluorescent protein, its Tm was 5° C. less than the monomeric Azami Green48 used to identify the sequences comprising the consensus. However, in this case no effort was made to examine the effects of individual mutations, and it is likely that some of the consensus mutations were destabilizing, as had been previously shown for the phytase40-43.
Other methods used to increase protein stability, relying heavily on structural information, include “helix capping”49-52 or optimization53-55, the introduction of salt bridges or their replacement by hydrophobic interactions56-62, the introduction of clusters of aromatic-aromatic interactions63-65 and rigidification strategies, in which disulfide bonds or glycine to alanine, or Xaa to proline changes are introduced66-68. However, most of these have been carried out on model structures, and none has been widely adopted.
Thermostabilization of proteins is regarded as important in a number of biotechnological and pharmaceutical applications. Within the context of industrial enzymes, thermostability leads to longer enzyme survival times, as well as more efficient reactions at higher temperatures and diminished microbial contamination, all of which result in diminished costs. In the pharmaceutical arena, thermostability of protein therapeutics leads to longer half lives and more effective drugs69-71. Thermostability has also been regarded as important in the use of proteins as scaffolds to generate libraries of specific binders. It has been reasoned that if a starting scaffold is more stable, it will be more tolerant to the destabilizing effects of mutations, or insertions, used to mediate binding. This has been shown for affinity reagents based on ankyrins72, and has also been applied to the creation of phage antibody libraries30. Finally, proteins of increased thermostability are more resistant to mutations than the protein from which they are derived, promoting evolvability by providing greater permissivity to mutations leading to novel functions73,74.