Synthetic DNA sequences are a vital tool in molecular biology. They are used in gene therapy, vaccines, DNA libraries, environmental engineering, diagnostics, tissue engineering and research into genetic variants. Long artificially-made nucleic acid sequences are commonly referred to as synthetic genes; however the artificial elements produced do not have to encode for genes, but, for example, can be regulatory or structural elements. Regardless of functional usage, long artificially-assembled nucleic acids can be referred to herein as synthetic genes and the process of manufacturing these species can be referred to as gene synthesis. Gene synthesis provides an advantageous alternative from obtaining genetic elements through traditional means, such as isolation from a genomic DNA library, isolation from a cDNA library, or PCR cloning. Traditional cloning requires availability of a suitable library constructed from isolated natural nucleic acids wherein the abundance of the gene element of interest is at a level that assures a successful isolation and recovery.
Artificial gene synthesis can also provide a DNA sequence that is codon optimized. Given codon redundancy, many different DNA sequences can encode the same amino acid sequence. Codon preferences differ between organisms and a gene sequence that is expressed well in one organism might be expressed poorly or not at all when introduced into a different organism. The efficiency of expression can be adjusted by changing the nucleotide sequence so that the element is well expressed in whatever organism is desired, e.g., it is adjusted for the codon bias of that organism. Widespread changes of this kind are easily made using gene synthesis methods but are not feasible using site-directed mutagenesis or other methods which introduce alterations into naturally isolated nucleic acids.
As another example, a synthetic gene can have restriction sites removed and new sites added. As yet another example, a synthetic gene can have novel regulatory elements or processing signals included which are not present in the native gene. Many other examples of the utility of gene synthesis are well known to those with skill in the art.
Furthermore, a sequence isolated from genomic DNA or cDNA libraries only provides an isolate having that nucleic acid sequence as it exists in nature. It is often desirable to introduce alterations into that sequence. For example a randomized mutant library can be created wherein random bases are inserted into desired positions and then expressed to find desirable properties relative to the wild type sequence. This approach does not allow for specific placement of degenerate bases. In another example, a gene enriched with repeat sequences could be used for genomic mapping or marking.
Although the cost of synthesizing a large library of genes can be substantial, the ability to optimize or change the characteristics of the encoded enzyme or antibody can result in a powerful biological tool or therapeutic. Recombinant antibodies such as Humira® (Abbot Laboratories, Inc.) are widely used as therapeutics, and many others are used as research tools. Those in the art also appreciate that many commercial proteins, such as enzymes, originated from mutant libraries.
Gene synthesis employs synthetic oligonucleotides as the primary building block. Oligonucleotides are made using chemical synthesis, most commonly using betacyanoethyl phosphoramidite methods, which are well-known to those with skill in the art (M. H. Caruthers, Methods in Enzymology 154, 287-313 (1987)). Using a four-step process, phosphoramidite monomers are added in a 3′ to 5′ direction to form an oligonucleotide chain. During each cycle of monomer addition, a small amount of oligonucleotides will fail to couple (n−1 product). Therefore, with each subsequent monomer addition the cumulative population of failures grows. Also, as the oligonucleotide grows longer, the base addition chemistry becomes less efficient, presumably due to steric issues with chain folding. Typically, oligonucleotide synthesis proceeds with a base coupling efficiency of around 99.0 to 99.2%. A 20 base long oligonucleotide requires 19 base coupling steps. Thus assuming a 99% coupling efficiency, a 20 base oligonucleotide should have 0.9919 purity, meaning approximately 82% of the final end product will be full length and 18% will be truncated failure products. A 40 base oligonucleotide should have 0.9939 purity, meaning approximately 68% of the final end product will be full length and 32% will be truncated failure products. A 100 base oligonucleotide should have 0.9999 purity, meaning approximately 37% of the final product will be full length and 63% will be truncated failure products. In contrast, if the efficiency of base coupling is increased to 99.5%, then a 100 base oligonucleotide should have a 0.99599 purity, meaning approximately 61% of the final product will be full length and 39% will be truncated failure products.
Using gene synthesis methods, a series of synthetic oligonucleotides are assembled into a longer synthetic nucleic acid, e.g. a synthetic gene. The use of synthetic oligonucleotide building blocks in gene synthesis methods with a high percentage of failure products present will decrease the quality of the final product, requiring implementation of costly and time-consuming error correction methods. For this reason, relatively short synthetic oligonucleotides in the 40-60 base length range have typically been employed in gene synthesis methods, even though longer oligonucleotides could have significant benefits in assembly. It is well appreciated by those with skill in the art that use of high quality synthetic oligonucleotides, e.g. oligonucleotides with few error or missing bases, will result in high quality assembly of synthetic genes than the use of lower quality synthetic oligonucleotides.
Some common forms of gene assembly are ligation-based assembly, PCR-driven assembly (see Tian et al., Mol. BioSyst., 5, 714-722 (2009)) and thermodynamically balanced inside-out based PCR (TBIO) (see Gao X. et al., Nucleic Acids Res. 31, e143). All three methods combine multiple shorter oligonucleotides into a single longer end-product.
Therefore, to make genes that are typically 500 to many thousands of bases long, a large number of smaller oligonucleotides are synthesized and combined through ligation, overlapping, etc., after synthesis. Typically, gene synthesis methods only function well when combining a limited number of synthetic oligonucleotide building blocks and very large genes must be constructed from smaller subunits using iterative methods. For example, 10-20 of 40-60 base overlapping oligonucleotides are assembled into a single 500 base subunit due to the need for overlapping ends, and twelve or more 500 base overlapping subunits are assembled into a single 5000 base synthetic gene. Each subunit of this process is typically cloned (i.e., ligated into a plasmid vector, transformed into a bacterium, expanded, and purified) and its DNA sequence is verified before proceeding to the next step. If the above gene synthesis process has low fidelity, either due to errors introduced by low quality of the initial oligonucleotide building blocks or during the enzymatic steps of subunit assembly, then increasing numbers of cloned isolates must be sequence verified to find a perfect clone to move forward in the process or an error-containing clone must have the error corrected using site directed mutagenesis.
Traditional methods for assembly have suffered from shortcomings of being unable to clone low complexity sequence motifs such as repeats, homopolymeric nucleotide runs, and high/low GC sequences. In addition, the ability to generate libraries of high sequence variation at defined sequences is even more problematic. Methods for overcoming these limitations have been developed that are based on the synthesis and incorporation of highly pure long single stranded oligonucleotides, such as Ultramers™ oligonucleotides (Integrated DNA Technologies, Inc.) into double stranded clonal/non-clonal PCR products (see gBlocks® gene block fragments from Integrated DNA Technologies, Inc.). Once fully assembled, the double stranded material can be subjected to error correction methodologies to improve the fidelity of the end product.
Libraries containing high sequence variation at defined sequences (see gBlocks® Gene Fragments Libraries from Integrated DNA Technologies, Inc.) consist of a specific sequence of DNA synthesized in the form of linear double stranded DNA. Libraries are designed to include variation within a single base to large region of sequence but limits the amount of variation present in each molecule of DNA. A common example is a hifi library consisting of the coding sequence of a variable chain of an antibody. The library may be constructed so that each codon within the chain is varied with an NNK sequence, but no single molecule contains more than one variation. This allows the researcher to explore variation over a large area of the sequence while limiting the variation to a number of sequences to be screened to N32 variations, where N is the number of codons in the sequence. This type of library itself is not novel and is often described in the literature under a more general term of saturation mutagenesis, however the construction of this type of library is usually costly and very time consuming. It is also almost impossible to eliminate the background of wild type sequence from the final construct, increasing the amount of screening required to assess all possible variants.
The methods of the invention described herein provide high quality synthetic genes containing regions of high variability. Although derived from clonally purified wild type parent sequences, the recombination rate is such that the percent of the wild type sequence present in the final mixture is greatly diminished. These and other advantages of the invention, as well as additional inventive features, will be apparent from the description of the invention provided herein.