Expression of recombinant proteins in prokaryotic systems such as in Escherichia coli is one of the most valuable tools in biotechnology both from the science and industry point of views. Yet, despite extensive research not every gene can be expressed successfully or efficiently in these organisms (Makrides, 1996; Morris and Geballe, 2000; Paulus et al, 2004; Zhang et al, 2006).
Following gene transcription, the expression process is largely controlled by the efficiency of the initiation, a phenomenon characterized by recruitment of mRNA and fMet-tRNA to the small subunit (30S) of the ribosome (Kozak, 1999; Laursen et al, 2005). On the mRNA molecule, two close nucleotide stretches have been long known for their critical role in specific interaction with the ribosome, the initiation codon and the Shine-Dalgarno (SD) sequence. The initiation codon is most often the tri-nucleotide AUG and conventionally stands at position +1 to +3 on the mRNA. The SD sequence is often a 5- to 13-nucleotide purine-rich motif with the core sequence of GGAGG that is optimally separated by 7±2 nucleotides upstream the initiation codon, i.e. SD sequence ends at position −8±2. Translation normally starts at position +1 (Ma et al, 2002; Sørensen and Mortensen, 2005). The initiation codon and SD sequence constitute the core of the ribosome binding site (RBS) on the mRNA 5′-end. Nevertheless, ribosome is known to embrace slightly larger areas of the mRNA during the translation process. Huttenhofer and Noller (1994) found that upon associates with a ribosome, a region spanning from positions approximately −35 to +20 of the mRNA is protected from a chemical modification. More recently Gulnara and coworkers (2001) employed X-ray crystallography to directly observe the path of mRNA inside the ribosome. Using a short synthetic mRNA with a SD sequence of AAGGAGG separated by 5 nt from the initiation ATG, they found that indeed only a region from nucleotides −15 to +16 of the mRNA is covered by the ribosome.
Amongst mRNA specifications that control the protein expressions, mRNA stability, codon usage, composition of the SD sequence and its distance from the initiation codon have received considerable attention from the scientists investigating the subject (for reviews see Makrides, 1996; Swartz, 2001; Sørensen and Mortensen; 2005). Nevertheless, many authors have also pointed to the intra-molecular Watson-Crick style bonds involving the SD sequences and the initiation codon as an additional determinant of the interaction between the mRNA and the 30S subunit and hence the protein expression level (Devlin et al, 1998; Makrides, 1996 and references therein; Helke et al, 1993 and references therein; Paulus et al, 2004). Composition of the downstream box (DB), that is sequences downstream the initiation codon, has also been implicated in protein expression levels most likely by controlling the mRNA secondary structure or folding (Sprengart et al, 1996; O'Connor et al, 1999; Stenström et al, 2001; Paulus et al, 2004). Since non-structural factors have been rather extensively studied they are often optimized in dedicated plasmid vectors used by experts and available through commercial suppliers. The less understood structural factors most relevant to this text, however, are discussed in the following
Helke and coworkers (1993) were amongst the first authors who quantitated the mRNA structure stability in the ribosome binding regions and reported that strong base-pairing in this section tends to decrease the expression of proteins in E. coli. To this end they isolated varying lengths from the beginning of a highly expressible bacteriophage T7 gene and placed it upstream of a cloned mouse dihydrofolate reductase gene. Protein amounts expressed by the construct were then recorded. Using a minimum free energy algorithm, the authors predicted one folding for each selected stretch of deduced mRNA molecules and calculated its averaged free energy (ΔG/nucleotide). By comparing the averaged free energies of different stretches of mRNA, they found that the region delimited by nucleotides −30 and +20 showed the best correlation with the expression of their model protein. These authors also reported that their method can predict the expression of many T7 genes but fails to predict the expression of nearly all non-T7 genes and suggested that other factors may control the expression of the later genes. Almost at the same time two other researchers, de Smit and van Duin (1990) who were working on recombinant expression in E. coli of the coat gene of bacteriophage MS2 reported a clear correlation between its translational efficiency and the stability of the mRNA initiation region secondary structure. Exploiting a natural hairpin structure involving 12 nucleotides in either sides of the initiation codon of their model gene and by careful site directed mutagenesis, they showed that loosening of the hairpin structures by as little as 1.4 kcal/mol could increase the gene's translational efficiency by an order of magnitude. These authors, too, used a minimum energy algorithm to predict the structure of the isolated stretch of the mRNA and its free energy although they used the total free energy of the stretch not the averaged (ΔG/nucleotide) value. De Smit and van Duin argued that concentration of the 30S subunit and its affinity for the mRNA's ribosome binding site, on one side, and the strength of the regional mRNA internal structure, on the other side, determines how many of the ribosomes can successfully interact with the mRNA. These authors suggest that ribosomes only bind to single-stranded RNA (which is in equilibrium with the folded form) and that loosening of the mRNA secondary structure in the RBS pushes the equilibrium towards more unfolded RBS and hence higher ribosome association and subsequent expression (de Smit and van Duin, 1990).
More recently, Voges and coworkers (2004) used a comprehensive statistical approach to investigate the effect of mRNA sequences downstream the initiation codon in a cell-free protein synthesis system (RTS 100 E. coli HY Kit, Roche Applied Science) based on the T7 promoter/terminator. These authors inserted a versatile array of 39-nucleotide stretches in position +4 of a GFP expression cassette and assessed the GFP expression levels in the new constructs. This was then correlated with up to 356 calculated sequence attributes including G+C contents and mRNA secondary structures in the first 300 nucleotides. However, unlike the previous studies emphasis were placed on the probability of individual nucleotides participating in base pair formation and on positions of local stem loops (as well as their energy contents). Voges and coworkers reported that the most significant factor correlated with expression levels in their experiment was the mRNA inverse G+C content, in particular in the third bases of codons 2 to 7. Nevertheless, the authors pointed out that this finding was in contrast with that in the innate E. coli highly expressible genes. These authors also reported that higher base pair probabilities downstream of the initiation point, in particular in bases +3 to +25 (almost corresponding to codons 1 to 9), were correlated with lower expression levels. The authors concluded that accessibility of unpaired nucleotides bases in this region encouraged translational efficiency. Attempts to predict protein expressions based on the above data was meet with only moderate success as the authors reported an adjusted correlation coefficient (R-square) of only 0.42 (Voges et al, 2004). A web-based application, ProteoExpert, developed based on this analysis and dedicated to optimized protein expression in cell-free systems is available biomax.com. A patent application related to this method was also found in the USPTO website (20060024679).
The controversy on the exact position and size of the region that controls the expression of recombinant proteins in prokaryotic cells has been a common theme in other reports too. Wang and coworkers (1994) reported that they analyzed a stretch of mRNA comprising 5 nucleotides upstream the SD to 40 nt downstream of AUG and discovered that potential secondary structures in this region markedly hamper the expression of their model protein, prochymosin. The minimum expression of prochymosin was obtained with the free energy of −11 kcal/mol in this region whereas smaller ΔG values down to −4 or −4.43 increased the expression up to an impressive 39% of the total cell proteins. Another authors, Cèbe, and Geiser (2006), used an experimental system based on the genes for sphingosine kinase 1 and the sclerostin protein to find out that the 5′ region of the mRNA spanning from the first A of the SD sequence to nucleotide +72 may be used to predict protein expression levels. They suggested that if the total ΔG in this region is above −4 to −4.78 kcal/mol, the mRNA will be effectively translated. On the other hand, stronger structure in this region is inhibitory to translation although this may be reversed by silent mutations in the region that disrupts the existing base pairs. More recently, Care et al (2007) estimated the free energy of the −70 to +96 region of the mRNA to optimize the expression of proteins. They reported that by mutating nucleotides in the −17 to +9 regions they reduced the free energy content of the crucial +70/+96 region and enhanced the expression of 8 out of 9 proteins that they used in their experiment. A web-based application, ExEnSo, developed based on this concept is available exenso.afmb.univ-mrs.fr.
Formation of intra-molecular bonds in mRNA secondary structure may be predicted using a variety of software exemplified by Rdfolder rna.cbi.pku.edu.com, Vienna RNA secondary structure server tbi.univie.ac.at, Sfold sfold.wadsworth.org, CONTRAfold contra.stanford.edu and mfold. Amongst these, the Vienna RNA secondary structure server appears to have been used in more articles (Voges et al, 2004; Cèbe, and Geiser, 2006; Zhang et al, 2006) although the algorithm of mfold was also successfully employed (Paulus et al, 2004).
Since mfold readily generates more than one structures with close minimum energies (known as optimal and sub-optimal structures) it is perhaps more appropriate for prediction of secondary structures in the dynamic mRNA molecules. This may be even more applicable considering concurrent translation and transcription in prokaryotes. The number of minimum-free-energy structures generated by mfold may be adjusted by the sub-optimality value but it is 5% by default (Zuker, 2003). The latest version of mfold (version 3.2) that uses improved thermodynamic values is used in the research presented throughout this application.
MBP8298 also mentioned in the following paragraphs is a 17-amino acid peptide that has been shown to constitute a novel treatment in management of multiple sclerosis. The peptide corresponds to amino acids 82 to 98 of the human myelin basic protein (MBP) and is presently produced by chemical synthesis only (Warren at al, 2006). (Paulus at al, 2004)