Structural genomics has gained increasing interest in recent years. The elucidation of protein structures is important to enhance the understanding of protein function and thereby facilitate pharmaceutical drug development.
Protein expression and purification are key processes in such studies, and are often limited by the ability to produce properly folded recombinant protein. The preparation of proteins for structural and functional analysis using the Escherichia coli (E. coli) expression system is often hampered by the formation of insoluble intracellular protein aggregates (inclusion bodies), degradation by proteases or lack of expression.
E. coli is a common expression host that often makes misfolded protein when obliged to overproduce non-native gene products. This severely limits the usefulness of the protein in areas such as structural analysis by crystallography and NMR and limits the overall success rate of current structural genomics projects. Conventional approaches to problem of insoluble expressed proteins include low-temperature expression, the use of promoters with different strengths, a variety of solubility-enhancing fusion tags (Kapust R B & Waugh D S. ‘Escherichia coli maltose-binding protein is uncommonly effective at promoting the solubility of polypeptides to which it is fused’. Protein Sci. 1999 August; 8(8) 1668-74) and modified growth media (Makrides S C ‘Strategies for achieving high-level expression of genes in Escherichia coli’. Microbiol Rev. 1996 September; 60 (3):512-38. Review).
Another approach for overcoming this difficulty is through structure prediction from the amino acid sequence of the protein of interest. Information such as homology alignments and secondary structure prediction is used to predict the position of stable, soluble domains. A truncation or mutation of the target protein is first constructed and then expressed and tested for solubility. Despite continuous progress, the purely ‘rational’ design of proteins with desired properties, such as stability or soluble expression, is, at least to date, not generally feasible. Even in the presence of extensive structural and mechanistic information, it is difficult to predict the necessary sequence truncation required. There is still little information as to how amino acid sequence affects every aspect of protein structure, from its ability to be expressed in a heterologous host to its ability to fold in non-native environments. Experiments have demonstrated that changes in protein properties are brought about by the cumulative effects of many small adjustments, many of which are distributed or propagated over significant distances within the protein molecule and bioinformatic programs are currently unable to predict accurately which truncations or mutations will increase protein solubility.
In normal structural projects, several tens of clones may be constructed and tested for soluble protein expression. With such projects, the possible diversity is greatly undersampled and often solutions are not found. Additionally, with many proteins predicted from genome sequences there are no known homologues and this limits the effectiveness of bioinformatics approaches. High throughput screening strategies can prove effective for discovering soluble constructs when standard approaches fail. These require the accurate analysis of large numbers of expression clones to identify suitable constructs for structure determination. If the whole protein does not express or crystallise, the next step is to generate truncations or random mutations and retest.
Although (i) current methodologies permit the creation of very large expression libraries; and (ii) the chances that a library contains a soluble protein increases with the size of the library, the practical limits imposed by current approaches for screening expression libraries restricts this practice. The ultimate aim of experimenters who wish to express a soluble or crystallisable form of a protein of interest is to synthesise all possible variants of a target protein and screen them for soluble expression. Clones expressing soluble protein can be used directly, or can be used to seed the next round of library construction and selection. Such experiments would yield a massive number of clones, which would then have to be screened for the expression of soluble target protein.
Several systems have been described that have the aim of identifying soluble variants of a candidate protein of interest (generated by random mutagenesis or truncation). In fusion reporter methods, a candidate protein and a reporter protein with an easily detectable feature or biological activity are expressed as a genetic fusion. Information about the folding state of the protein can be derived from a screenable or selectable activity by the fused reported domain.
Fusion reporter methods usually involve fusion of a C-terminal partner “solubility reporter” (e.g. green fluorescent protein (GFP), Chloramphenicol acetyl transferase (CAT) or beta galactosidase. In the GFP fusion reporter method, the fluorescent yield of GFP provides information about the folding state of its fusion partner. Cells expressing GFP fused to a poorly folded insoluble protein fluoresce less brightly than those expressing GFP fused to a well-folded soluble protein. GFP monitors the folding yield of the test protein, which is subsequently expressed without the GFP tag (Waldo G S ‘Genetic screens and directed evolution for protein solubility’. Curr Opin Chem Biol. 2003 February; 7(1):33-8. Review).
The inventor has previously developed a fusion reporter system based on the use of biotin carboxyl carrier protein (BCCP) as a protein-folding marker. In this system, the biotinylation domain of BCCP from E. coli is fused to a test protein. The correctly folded secondary and tertiary structure of this domain is recognised by endogenous host cell biotin protein ligase which biotinylates the domain. Host cells expressing correctly folded test protein and BCCP domain will test positive for the presence of the biotin group (WO03/064656 ‘Protein tag comprising a biotinylation domain and method for increasing solubility and determining folding state’).
However, there are problems associated with these systems, which limit their applicability.
The use of autonomously folding reporter proteins (e.g. GFP, CAT, beta-gal or BCCP domain) can generate problematic false positive rates due to their large and soluble nature. This can generate overwhelming false positive rates because the reporter can tolerate fusion of otherwise insoluble protein X fragments or full-length proteins without itself becoming insoluble. This may not be a problem when the tag can be left in place e.g. when immobilising proteins via the tag or performing biochemical analyses on the purified protein, but many applications e.g. protein crystallography, require removal of the tag by protease cleavage or genetic deletion; much time and expense is lost by processing clones that subsequently aggregate or degrade upon tag removal and are therefore unusable. It is also possible for the fusion protein to be degraded by proteolysis during expression in vivo, which leaves a soluble fluorescent reporter molecule that generates false positive results.
These effects are very commonly observed with fusion proteins such as those containing maltose binding protein, glutathione-S-transferase, GFP, thioredoxin and is presumably a general effect. Thus, the presence of a highly soluble fusion partner acting as a solubility reporter strongly perturbs the solubility of what it is fused to.
Furthermore, most of the fusion proteins disclosed in the prior are large proteins. For example, fusion of GFP increases the size of the protein by approximately 37 kDa. Expression of large fusion proteins in E. coli. is problematic, with a practical limitation of about 100 kDa.
Simulation studies, when combined with experiments and sequence/structure database analyses, can help delineate major evolutionary factors responsible for shaping proteins. However, the potential of such studies has not as yet been fully explored.
Accordingly, there thus exists a great need in the art for the development of a method for rapid, high throughput and reliable screening of the expressed proteins as early as possible in the overall process from cloning to structure determination, allowing the selection of soluble expressed proteins. Suitable methods should allow the high throughput screening of a large number of molecules containing different variant sequences, with the selection process allowing the easy identification of molecules with improved solubility. The amenability of such a method to the high throughput analysis of an expression library of variants of individual proteins, especially when used in combination with a mutation or truncation procedure strategy, to enable the identification and isolation of soluble variants of insoluble proteins would make the optimisation of high level expression of a problematic protein more affordable and less laborious. Additionally, the method should seek to i) minimise the pertubatory effects of any fusion partner and ii) should minimise the downstream steps required for structural analysis such as removal of the fused tag; proteins are routinely crystallised with small peptide tags but rarely as bidomain fusions.