There are many large soluble, transmembrane and integral membrane multi-domain proteins of intense biomedical interest. These substances are by definition potential drug targets. Structural and functional analyses of these proteins will provide the basis for design of new strategies for therapeutic intervention in disease. High resolution structural study of proteins provides a basis for understanding biological and disease processes at molecular and atomic levels that is often necessary to support rational design or optimisation of new candidate drugs.
Biochemical and functional assays are used in drug discovery programs to identify compounds that interact with proteins in a manner that interferes with the biological function of the protein. These assays require large quantities of soluble protein to allow screening of thousands of compounds from chemical libraries. However, the production of sufficient quantities of these large proteins for detailed functional and structural studies is rarely feasible using existing methods. In the rare cases where sufficient quantities of large multi-domain proteins can be produced, it is seldom possible to obtain the protein crystals that are prerequisite to structural study by X-ray crystallography or other techniques used in the art such as NMR. However, production of soluble fragments of these proteins may allow identification of regions of a protein that are responsible for the biological functions (or malfunctions), and facilitate detailed structural and functional analysis. Production of soluble protein fragments is therefore necessary to allow in-vitro biochemical and structural analyses of multi-domain proteins that cannot be obtained in sufficient quantities in intact form. However, little is known about the domain structure and organisation of many of these large proteins and bio-informatics approaches often do not provide a sufficient basis for rational identification of candidate domains. As a result, identification and expression of domains from many of these large proteins have proved refractory to the established, rational, recombinant protein engineering/expression strategies.
There are currently three main empirical approaches to identification of soluble protein domains: 1) bioinformatics and sequence analysis to estimate the location of domain boundaries of proteins based on sequence similarities with known proteins, 2) proteolytic fragmentation of the intact protein and identification of soluble fragments (REF), and 3) generation of “random” gene fragments, cloning to produce a gene fragment library and expression screening of the library to identify clones expressing soluble, folded protein fragments. Holistically these methods suffer from a number of weaknesses such as: a requirement for quantities of the intact multi-domain protein for fragmentation that often cannot be obtained; failure to isolate gene fragments capable of producing soluble protein domains.
The most commonly used method for identification of minimal protein domains (domain-mapping) involves limited proteolysis of a target protein and identification of proteolytically resistant fragments by mass spectroscopy (e.g. Cohen, S. L. (1996)). This approach is based on the assumption that stable, folded domains are likely to be more resistant to proteolysis than unstructured regions of peptide sequence that are often found between domains. As this approach usually requires a reasonable quantity of highly purified, intact, soluble target protein derived from the native biological source, a large portion of human proteins of biomedical interest cannot be obtained in sufficient quantities. Protein samples are then enzymatically fragmented using various proteases. The molecular masses of the protein fragments generated are then measured by mass spectroscopy and the identity of the fragments may then be confirmed by further fragmentation (i.e. protein sequencing by MS). It is then assumed that protein fragments of around sixty or more amino acids residues in length represent stably folded domains since these portions of the protein appear to have greater resistance to degradation by proteases. This information is then used to design expression vectors for recombinant expression of the soluble domain candidates identified above.
In practice, there are several caveats with this approach that may result in failure to detect individual protein domains. The cleavage specificity of proteases is limited to the peptide bond between certain amino-acid residue types (e.g. trypsin cleaves the peptide bond to the C-terminal side of basic residues). The position of protease cleavage sites is therefore not a function solely of structural context, but also of amino acid sequence context. Thus, if in practice the appropriate amino acid types are not found in a particular inter-domain peptide sequence, then the adjacent domains may not be separated and therefore the individual domains would not identified. In addition, steric hindrance may prevent protease-mediated cleavage of inter-domain peptide sequences that are short in length. Another major caveat of these approaches is that many domains comprise flexible loop regions that may be proteolytically sensitive resulting in cleavage within a domain (i.e. fail to detect the correct boundaries of a domain). Finally, a peptide sequence that corresponds to a soluble, folded proteolytic fragment may not necessarily be capable of autonomous folding and therefore recombinant over-expression of this particular peptide sequence may fail to produce soluble protein of tertiary structural integrity.
A DNA fragmentation based domain-mapping/identification method requires a protocol for generation of DNA fragments from an intact coding sequence in a manner that allows essentially random sampling of all possible fragments of appropriate size range (i.e. of a size capable of coding for a domain ˜200-1500 nucleotides). In addition, the fragmentation protocol should ideally be generically reproducible, and must therefore be independent of differences in the properties of particular DNA targets, and produce fragments that are compatible with conventional methods for cloning of DNA into vectors for protein expression. However, none of the existing DNA fragmentation methods fully meet requirements of random sampling, generic reproducibility, often displaying biased sampling and/or requiring optimisation of the method for particular target DNA properties such as DNA chain-length, and/or producing fragments that are incompatible with subsequent cloning applications. This is not surprising as many methods for fragmenting large DNA molecules have been developed for a wide variety of purposes other than protein domain identification.
A DNA fragmentation based domain-mapping/identification method requires a method for cloning of the DNA fragment mixture to produce a library of the gene fragments. A screening assay must then be used to identify clones that produce soluble folded protein fragments. A number of approaches have been developed for generation of libraries of different clones for a range of purposes including: large-scale DNA sequencing projects (e.g. shotgun cloning); selection of mutant proteins with particular enhanced functional properties (e.g. using gene-shuffling or random mutagenesis); and identification of epitopes for monoclonal antibodies by selection from a phage-display peptide library. Established library-based approaches to selection of protein variants or mutants have been recently adapted to identification of domains in large proteins including for example: a) cloning of DNA fragments into a bacteriophage surface-expression vector for expression as fusions with bacteriophage structural proteins (phage-display) using affinity selection as readout; b) cloning of DNA fragments into expression vectors to produce fusions with a reporter gene such as GFP or an antibiotic resistance gene, using fluorescence and antibiotic resistance respectively as readout of recombinant protein solubility in vivo.
Phage display approaches involve enzymatic fragmentation of coding DNA and cloning of these fragments into a bacteriophage surface-expression vector to produce a phage display library of clones expressing different gene fragments on their surface. A method has been described involving shotgun cloning coupled with phage display mapping of functional domains of two streptococcal cell-surface proteins (Jacobson, et al., 1997). A phage-display library may be screened using a number of different approaches such as: target protein specific affinity selection and DNA sequencing of clones to identify the minimal fragment that retains binding affinity (e.g. Moriki et al., 1999); surface immobilisation of phage clones followed by limited proteolysis and washing to identify recombinant bacteriophage clones that are most resistant to proteolysis and are likely to display a fragment that has tertiary structure (Finucane et al., 1999). A limitation of affinity selection methods for screening of fragment libraries is a requirement for knowledge of the binding affinity(s) of the target protein, since this excludes the large number of proteins for which no specific binding or enzymatic activity has yet been established. Screening by limited proteolysis of phage particles adhered to a surface also suffers from the same caveats as other limited proteolysis approaches described above.
“Random PCR” has been used to generate fragments of target coding sequence for screening for soluble domains as fusions with green fluorescent protein (Kawasaki and Inagaki 2001). Caveats with this approach include: “random PCR” is not truly random and will therefore not produce a complete library of all possible gene fragments of the appropriate size range; attachment of GFP to the expressed gene fragment may affect the folding and solubility of particular candidate domains resulting in both false negative and false positive results. An in vivo method for improvement of the solubility of proteins and protein domain constructs has been described involving mutagenesis of target proteins and production of fusions of target proteins with the antibiotic resistance gene chloramphenicol acetyl transferase and selection of clones with enhanced resistance to chloramphenicol (Maxwell et al., 1999). This method has not been used for domain identification. A caveat with this method is that there is only limited discrimination between soluble and insoluble proteins and the method does not select between folded and misfolded soluble fusions. An in vivo structural complementation based assay has been described involving fusions of the alpha fragment of beta-galactosidase with the C-terminus of target proteins so that if the fusion protein proves to be insoluble then interaction with the omega subunit will be prevented resulting in loss of beta-galactosidase activity (Wigley et al., 2001).
In summary, phage-display and fusion protein based methods have the common caveat that attachment of a reporter protein to a test protein is likely to influence the folding and solubility of the test protein in an unpredictable and target protein specific manner. In practice, existing DNA fragmentation approaches are not ideal for protein domain identification methods as none of these fully meet the requirements of random sampling, generic reproducibility and compatibility with subsequent cloning applications. In addition, all existing methods for domain identification including limited proteolysis, gene fragmentation based methods such as phage display and fusion protein based screening methods all have serious limitations. These undoubtedly lead to failure to detect some protein domains and failure to identify the domains or regions of protein that are responsible for biological activities that could become the new targets for therapeutic intervention and drug development.