Protein insolubility constitutes a significant problem in basic and applied bioscience, in many situations limiting the rate of progress in these areas. Protein folding and solubility has been the subject of considerable theoretical and empirical research. However, there still exists no general method for improving intrinsic protein solubility. Such a method would greatly facilitate protein structure-function studies, drug design, de novo peptide and protein design and associated structure-function studies, industrial process optimization using bioreactors and microorganisms, and many disciplines in which a process or application depends on the ability to tailor or improve the solubility of proteins, screen or modify the solubility of large numbers of unique proteins about which little or no structure-function information is available, or adapt the solubility of proteins to new environments when the structure and function of the protein(s) are poorly understood or unknown.
Overexpression of cloned genes using an expression host, for example E. coli, is the principal method of obtaining proteins for most applications. Unfortunately, many such cloned foreign proteins are insoluble or unstable when overexpressed. There are two sets of approaches currently in use which deal with such insoluble proteins. One set of approaches modifies the environment of the protein in vivo and/or in vitro. For example, proteins may be expressed as fusions with more soluble proteins, or directed to specific cellular locations. Chaperons may be coexpressed to assist folding pathways. Insoluble proteins may be purified from inclusion bodies using denaturants and the protein subsequently refolded in the absence of the denaturant. Modified growth media and/or growth conditions can sometimes improve the folding and solubility of a foreign protein. However, these methods are frequently cumbersome, unreliable, ineffective, or lack generality. A second set of approaches changes the sequence of the expressed protein. Rational approaches employ site-directed mutation of key residues to improve protein stability and solubility. Alternatively, a smaller, more soluble fragment of the protein may be expressed. These approaches require a priori knowledge about the structure of the protein, knowledge which is generally unavailable when the protein is insoluble. Furthermore, rational design approaches are best applied when the problem involves only a small number of amino-acid changes. Finally, even when the structure is known, the changes required to improve solubility may be unclear. Thus, many thousands of possible combinations of mutations may have to be investigated leading to what is essentially an “irrational” or random mutagenesis approach. Such an approach requires a method for rapidly determining the solubility of each version.
Random or “irrational” mutagenesis redesign of protein solubility carries the possibility that the native function of the protein may be destroyed or modified by the inadvertent mutation of residues which are important for function, but not necessarily related to solubility. However, protein solubility is strongly influenced by interaction with the environment through surface amino acid residues, while catalytic activities and/or small substrate recognition often involve partially buried or cleft residues distant from the surface residues. Thus, in many situations, rational mutation of proteins has demonstrated that the solubility of a protein can be modified without destroying the native function of the protein. Modification of the function of a protein without effecting its solubility has also been frequently observed. Furthermore, spontaneous mutants of proteins bearing only 1 or 2 point mutations have been serendipitously isolated which have converted a previously insoluble protein into a soluble one. This suggests that the solubility of a protein can be optimized with a low level of mutation and that protein function can be maintained independently of enhancements or modifications to solubility. Furthermore, a screen for function may be applied concomitantly after each round of solubility selection during the directed evolution process.
In the absence of a screen for function, for example when the function is unknown, the final version of the protein can be backcrossed against the wild type in vitro to remove nonessential mutations. This approach has been successfully applied by Stemmer in “Rapid Evolution Of A Protein In Vitro By DNA Shuffling,” by W. P. C. Stemmer, Nature 370, 389 (1994), and in “DNA Shuffling By Random Fragmentation And Reassembly: In Vitro Recombination For Molecular Evolution,” by W. P. C. Stemmer, Proc. Natl. Acad. Sci. USA 91, 10747 (1994) to problems in which the function of a protein had been optimized and it was desired to remove nonessential mutations accumulated during directed evolution. The development of highly specialized protein variants by directed, in vitro evolution, which exerts unidirectional selection pressure on organisms, is further discussed in: “Searching Sequence Space: Using Recombination To Search More Efficiently And Thoroughly Instead Of Making Bigger Combinatorial Libraries,” by Willem P. C. Stemmer, Biotechnology 13, 549 (1995); in “Directed Evolution: Creating Biocatalysts For The Future,” by Frances H. Arnold, Chemical Engineering Science 51, 5091 (1996); in “Directed Evolution Of A Fucosidase From A Galactosidase By DNA Shuffling And Screening,” by Ji-Hu Zhang et al., Proc. Natl. Acad. Sci. USA 94, 4504 (1997); in “Functional And Nonfunctional Mutations Distinguished By Random Combination Of Homologous Genes,” by Huimin Zhao and Frances H. Arnold, Proc. Natl. Acad. Sci. USA 94, 7007 (1997); and in “Strategies For The In Vitro Evolution of Protein Function: Enzyme Evolution By Random Recombination of Improved Sequences”, by Jeff Moore et al., J. Mol. Biol. 272, 336-346 (1997). Therein, efficient strategies for engineering new proteins by multiple generations of random mutagenesis and recombination coupled with screening for improved variants is described. However, there are no teachings concerning the use of directed evolutionary processes to improve solubility of proteins; rather, the mutagenesis was directed to improvement of protein function. It should be mentioned, however, that in order for the protein to function properly in any environment, it must at least be correctly folded.
Finally, for structural determination it is often not necessary or even desirable to have a fully functional version of the protein. If the mutational rate is low (ensured by molecular backcrossing), it is likely that the structure of the wild-type and solubility optimized versions of a protein will be similar. As long as the protein is soluble, and a structure can be obtained, it should then be possible to redesign the solubility of the protein using rational methods, if desired.
Wild type green fluorescent protein (GFP) cloned from Aequorea Victoria, normally misfolds and is poorly fluorescent when overexpressed in the heterologous host E. coli. It is found predominantly in the inclusion body fraction of cell lysates. The misfolding is incompletely understood, but is thought to result from the increased expression level or rate in E. coli, or the inadequacy of the bacterial chaperone and related folding machinery under conditions of overexpression. The folding yield also decreases dramatically at higher temperatures (37° C. vs. 27° C.). This wild type GFP is a very poor folder, as it is extremely sensitive to the expression environment.
Green fluorescent protein has become a widely used reporter of gene expression and regulation. DNA shuffling has been used to obtain a mutant having a whole cell fluorescence 45-times greater than the standard, commercially available plasmid GFP. See, e.g., “Improved Green Fluorescent Protein By Molecular Evolution Using DNA Shuffling,” by Andreas Crameri et al., Nature Biotechnology 14, 315 (1996). The screening process optimizes the function of GFP (green fluorescence), and thus uses a functional screen. Although the screening process coincidentally optimizes the solubility of the GFP, in that the GFP is only fluorescent when properly folded, there is no mention of using soluble GFP as a tag to monitor solubility of other proteins; that is, the function of the protein and not its solubility are being modified. In “Wavelength Mutations And Post-translational Auto-oxidation Of Green Fluorescent Protein,” by Roger Heim et al., Proc. Natl. Acad. Sci. USA 91, 12501 (1994), GFP was mutagenized and screened for variants with altered absorption or emission spectra. The authors mention that in place of proteins labeled with fluorescent tags to detect location and sometimes their conformational changes both in vitro and in intact cells, a possible strategy would be to concatenate the gene for the nonfluorescent protein of interest with the gene for a naturally fluorescent protein and express the fusion product. However, the focus of this paper is the extension of the usefulness of GFP by enabling visualization of differential gene expression and protein localization and measurement of protein association by fluorescence resonance energy transfer, by making available two visibly distinct colors. There is no mention of the use of the gene construct for solubility determinations. The paper further discusses the expression of GFP in E. coli under the control of a T7 promoter, and that the bacteria contained inclusion bodies consisting of protein indistinguishable from jellyfish or soluble recombinant protein on denaturing gels, but that this material was completely nonfluorescent, lacked the visible absorbance bands of the chromophore, and did not become fluorescent when solubilized and subjected to protocols that renature GFP, as opposed to the soluble GFP in the bacteria which undergoes correct folding and, therefore, fluoresces.
Chun Wu et al. in “Novel Green Fluorescent Protein (GFP) Baculovirus Expression Vectors,” Gene 190, 157 (1997), describe the construction of Baculovirus expression vectors which contain GFP as a reporter gene. The authors follow the production and purification of a protein of interest by in-frame cloning of the gene that expresses the protein in insect cells with the GFP open reading frame, thereby permitting visualization of the produced GFP-fusion protein using UV light. However, the purified GFP-XylE fusion protein was found to be insoluble after harvest. The authors did not correlate the level of fluorescence of the cells expressing the GFP-XylE fusion protein with the solubility of cells expressing the XylE protein alone. Therefore, this reference does not teach the use of the fusion protein fluorescence as an indicator of the solubility of the specific protein XylE or of the solubility of other proteins.
In “Application Of A Chimeric Green Protein Fluorescent Protein To Study Protein-Protein Interactions,” by N. Garamszegi et al., Biotechniques 23, 864 (1997), the authors discuss the fusion between GFP and human calmodulin-like protein (CLP) and show that this protein retains fluorescence and the known characteristics of CLP. That is, the GFP portion remains responsible for efficient fluorescent signals with little or no influence on the properties of the fused protein of interest. The authors maintain that the exhibited GFP fluorescence provides information concerning the maintenance of the GFP structural integrity in the chimeric protein, but does not provide information about the integrity of the entire fusion protein and, in particular, does not allow any statements concerning the maintenance of CLP function or integrity. From these statements, it is clear that this paper does not contemplate the use of the GFP as a solubility reporter for the CLP.
It has been demonstrated that improving the apparent functionality of a protein can sometimes increase the concomitant solubility of the protein, as in: “Redesigning enzyme topology by directed evolution,” by G. Macbeath, P. Kast, and D Hilvert, Science 279, 1958-1961 (1998); “Expression of an antibody fragment at high levels in the bacterial cytoplasm,” by P. Martineau, P. Jones, and G. Winter, J. Mol. Biol. 280, 117-127 (1998); “Antibody scFv fragments without disulfide bonds made by molecular evolution,” K. Proba, A. Worn, A. Honegger, and A. Pluckthun, J. Mol. Biol. 275, 245-253 (1998); and “Functional Expression of Horseradish Peroxidase in E. coli by Directed Evolution,” Lin Zhanglin, Todd Thorsen, and Frances H. Arnold, Biotechnol. Prog. 15, 467-471 (1999). In each case, the driving force for the directed evolution was the functionality of the protein of interest. For example, if the protein was an enzyme, the assay for improved function was the turnover of a chromogenic analog of the enzyme's natural substrate; if the protein was an antibody, it was the recognition of the target antigen by the antibody.
For cytoplasmic expression of antibodies, the recognition was linked to cell survival, (binding of the antibody to a selectable protein marker which was an antigen for the antibody of interest providing selection for functional antibodies); in the case of phage displayed antibodies without disulfide bonds, the recognition was transduced to successful binding of the displayed phage to the target antigen of the displayed antibody in a biopanning protocol. An apparent increase in the amount of protein expressed in the soluble fraction relative to the unselected target proteins was noted upon expression of the proteins in E. col. The apparent increase in activity of desirable mutants during the evolution was due at least in part to an increase in the number of correctly folded (and hence functional) protein molecules, and not exclusively to an increase in the specific activity of a given protein molecule. However, the driving force for the selection or screening process during the directed evolution depended on the functionality (and functional assay for) the protein of interest.
Many proteins have no easily detectable functional assay, and thus identification of proteins with improved folding yield by an increase in apparent activity due to a larger number of correctly folded molecules, is not a general method for improving folding by directed evolution. Furthermore, even when functional assays are available, apparent increases in activity can also be due to increases in the specific activity (activity of an individual protein molecule) even when the total number of correctly folded molecules remains the same. Thus, increases in apparent activity do not necessarily translate to increases in the solubility of proteins. Furthermore, functional assays are protein-specific, and thus must be developed on a case-by-case basis for each new protein. Functional assays therefore lack the generality needed to identify proteins which are soluble, or to find genetic variants (mutants and fragments) of proteins with improved solubility, in a high-throughput manner for proteomics or functional genomics wherein large numbers of different proteins about which little or no functional/structural information is known, are to be solubly expressed.
Stemmer and coworkers applied directed evolution to screen for mutants or variants of GFP that exhibited increased fluorescence and folding yield in E. coli (see, e.g., Crameri et al., Nat. Biotechnol. 143:315-319, 1996). They identified a mutant that exhibited increased folding ability. This version of GFP, termed cycle-3 or GFP3 contains the mutations F99S, M153T and V163A. GFP3 is relatively insensitive to the expression environment and folds well in a wide variety of hosts, including E. coli. GFP3 folds equally well at 27° C. and 37° C. Thus, the GFP3 mutations also appear to eliminate potential temperature sensitive folding intermediates that occur during folding of wild type GFP.
GFP3 can be made to misfold by expression as a fusion protein with another poorly folded polypeptide. GFP3 has been used to report on the “folding robustness” of N-terminally fused proteins during expression in E. coli (Waldo et al., Nat. Biotechnol. 17:691-695, 1999). If test protein, Xi, misfolds and is insoluble when expressed in E. coli, cells expressing the corresponding fusion protein Xi-L-GFP3 (where L is a small flexible linker) are poorly fluorescent, indicating the high probability of failure of the GFP3 to fold and become fluorescent. On the other hand, when protein Xs folds well and is highly soluble when expressed in E. col, cells expressing the corresponding fusion protein Xs-L-GFP3 are highly fluorescent, indicating the successful folding of the GFP3 domain. These observations suggest the presence of latent folding defects in the folding trajectory of GFP3 and that poorly folded fused polypeptides effectively ‘bait’ the GFP3 to misfold.
This aspect of GFP3 folding has been used to evolve soluble versions of proteins that normally misfold and aggregate when expressed in E. coli. This methodology is described, for example, in WO 01/23602. In these methods, the sequence of the reporter, e.g., GFP3 domain, remains constant and a poorly folded upstream domain is mutated. Better folded variants of domain X are identified by increased fluorescence.