It is frequently desirable to express proteins encoded by nucleic acids, for example for production of the protein to be used in a therapeutic or biocatalytic application, or for the protein to perform a function within the cell in which it is expressed. Due to the degeneracy of the genetic code, there are numerous different nucleotide sequences that can all encode the same protein. Redesigning a naturally occurring gene sequence by choosing different codons without necessarily altering the encoded amino acid sequence often dramatically increases protein expression levels (Gustafsson et al, 2004, “Codon bias and heterologous protein expression,” Trends Biotechnol 22, 346-53).
The inspiration for most codon optimization algorithms comes from assessing coding sequence characteristics present in naturally derived genomic sequences as a proxy for synthetic genes. The assumption guiding this method is that synthetic genes will express well if the gene sequence mimic the nucleotide sequence characteristics of the host genome. Variables such as codon adaptation index (CAI), mRNA secondary structures, cis-regulatory sequences, GC content and many other similar variables have been shown to somewhat correlate with protein expression levels (Villalobos et al, 2006, “Gene Designer: a synthetic biology tool for constructing artificial DNA segments,” BMC Bioinformatics 7, 285). A problem with these correlations is that protein expression is generally believed to be controlled at the level of initiation of transcription and translation, not translational velocity. These factors are controlled by promoter strength and the strength of the ribosome binding site, which are different for every natural protein, and which are not taken into account in such blunt analyses as the most common codon for a particular amino acid in every protein in an organism's genome. The sequence characteristics of the coding sequences may reflect other factors such as evolutionary constraints involved in facilitating DNA replication, mutational bias, intrinsic metabolic regulation, transposon resistance, ancestral origin etc. rather than serving as a useful guide to design principles with which to obtain high levels expression of recombinant protein (Moura et al., 2005, “Comparative context analysis of codon pairs on an ORFeome scale,” Genome Biol. 6, R28).
To date, there has been no systematic study of the effect of codon choices on protein expression, while keeping other expression control elements, such as promoters and ribosome binding sequences, constant. Thus there is currently no reliable strategy for selecting the codons in a synthetic gene to obtain high protein expression levels, nor is there currently a reliable algorithm with which to assess the likely level of protein expressed from a synthetic gene. There is thus a need in the art for both of these.