1. Field of the Invention
The present invention relates generally to the fields of genomics, synthetic biology and genetic engineering. More particularly, the present invention concerns the methods that enable parallel multiplex ligation and amplification on surface for making assemblies of nucleic acids of various biological applications and for analysis of biological samples such as DNA, RNA, and proteins.
2. Description of Related Art
The invention relates to the fields of nucleic acid technologies, specifically to the preparation and application of nucleic acids of predetermined sequences and their use. Increasingly, research and applications at genomic scale based on fundamental molecular sciences dominate the major advancement of biosciences and technologies. As the scope of the problems to be investigated is quickly expanding, there must be tools available for faster, cheaper and better experiments. Dramatic progress has been made in the transition from traditional molecular biological techniques to miniaturization, parallelization, and automation, in investigating problems at genomic and proteomic scales. Traditional single experiments are now performed on 96- or 384-well plates in parallel using liquid handling robotics. These experiments use micromole (μmol) of materials and milliliters (ml) of solutions. However, the present level of advancement is limited in large-scale applications, such as those at the genomic-scale or involving large sample sets. This is because such a large scale experiment would require thousands to millions of tests. The consequence is extremely costly material preparation tasking a very long period of time (months to years). One such example is genome wide single nucleotide polymorphism (SNP) analysis based on a large population. Such an experiment would provide invaluable information for genetic prediction and prevention of hereditary diseases and for molecular diagnostics of life-threatening diseases, such as cancers. If a single experiment per SNP per person uses ten milliliter (10 ml) of solution, the overall experiment for, for instance, 100,000 SNP and 1,000 people, would then consume, for solvents alone, 1,000,000 liters (l), equivalent to the annual capacity of a small chemical plant. Another example that demonstrates the inadequacy of the current methods is preparation of synthetic genes from synthetic oligodeoxyribonucleotides (oligos) for genome assemblies. A small genome usually contains approximately five million base pairs (bps) which would include several thousand genes. For the solvent alone, the conventional methods of oligo preparation would consume 50,000 l or more assuming 5 ml solvent consumption per synthesis cycle. Clearly, at this level of material consumption, it is not practical to conduct research and development at a genome-scale. For these large-scale experiments, a massive amount of instruments and ample spaces would be required for handling and storing of these reagents. The overall process would be laborious, time consuming, and error-prone. To overcome these problems, it is desirable to development technologies that reduce the consumption of reagent from μmol (solid) or ml (liquid) by a factor of 1,000 or more. The advantages for such technologies are evident and would enable genome-scale experiments, accelerate the understanding of complex biology of cellular systems, and permit discovery of novel regulatory mechanisms and saving of natural resources. The saving in material consumption and time also translates into environmental friendly and economic sensible process.
Synthesis of large DNA fragments which may be partial or complete gene sequences, any part of chromosomal DNA or DNA of biological sources, or any arbitrary sequences is one goal of the present invention. DNA sequence information and powerful computational methods now make it possible to engineer DNA sequences. These sequences can simulate or alter the functions and roles of a large number of transcribed RNA and translated proteins. This emerging field, called synthetic biology, encompasses the creation of DNA libraries for transcription of RNA sequences and for expression of proteins/antibodies and peptides which can provide biomedical, agricultural, and environmental benefits. Synthetic biology also involves the construction of entire genomes for making RNA and proteins, which can then be assembled to form biomolecular complexes, biological pathway systems, organisms, and cells. The current methods of DNA synthesis are too expensive and too slow for assembling long nucleic acid molecules from oligos or restricted to natural DNA sequences through assemblage of shuffled digested DNA fragments (Stemmer, 1994).
DNA synthesis using oligos has been used by molecular biologists for making natural genes, mutated genes (truncation, fusion, insertion/deletion), hybrid genes, transgenic genes, etc. (Dillon et al., 1990; Stemmer et al., 1995; Au et al., 1998). Synthetic genes, which often have lengths of one thousand base pairs (1 kbp) or greater, have traditionally been assembled one at a time by joining oligos of 30-80 bps in solution as a pool of mixture sequences. The oligos are specially designed according to the sequence of the DNA to be assembled and chemically synthesized on solid support, such as controlled porous glass (CPG), and the oligos are assembled to form long DNA usually without purification. The gene assembling process accomplishes two tasks: (1) annealing or hybridization of oligos to form a duplex and (2) ligation to join these oligos to form a long chain of covalently linked nucleotides. Alternatively, oligo duplexes containing overlapping regions can be extended into long chain products by the polymerase chain reaction (PCR). Present methods of gene synthesis may differ in order of the hybridization, ligation and/or PCR steps but all have the same limitations with respect to scalability. One current synthesis method is to design a set of oligos according to the DNA of interest and combine ligation and PCR in the same process, (e.g. ligation chain reaction (LCR)), which uses a pool of oligos and both DNA ligation and extension enzymes. This process generates DNA fragments of intermediate lengths and these fragments are subsequently joined as the full length DNA using overlapping PCR. This method has been used to produce a synthetic 5.4 kbp phage (Smith et al., 2003) and a 7.5 kbp polioviral genome (Cello et al., 2002). Another method (U.S. Pat. No. 6,521,427 and U.S. Pat. No. 6,670,127) of synthesizing a double-stranded polynucleotide includes annealing a terminus duplex which is sequentially annealed to another oligo and this annealing step is serially repeated to produce a double stranded long DNA. The nicks between the annealed oligos in the duplex are ligated. Overall, the current methods of DNA synthesis produce a single sequence per assembling reaction and thus are slow and expensive.
Oligonucleotide synthesis, historically, was not developed for large-scale parallel applications but rather for applications requiring individual sequences. Today it is still accomplished essentially on a one-by-one basis. Current methods of high throughput synthesis are limited to about 40,000 bp/day and costs about $0.10/bp. Thus, preparing the oligos for assembling a small genome of 5,000 genes of about 1,000 kbp per gene (5M bps) would take $0.5 M and 125 days (counting 24 h/day of operating time). For the gene synthesis, the total number of 40-mer oligos required is 250,000. The oligos would be individually collected, brought to a required concentration, and then pooled according to the gene synthesized. A laboratory would need to depend on liquid handling robotic instruments and large temperature-controlled storage spaces, making the overall process even more time consuming and costly. Since the pooled oligos are prone to operator errors and may have different concentrations, the deficiency of one oligo in an assembly could cause synthesis of the entire gene to fail. Under these sub-optimal conditions, a single synthetic gene would cost about $2.00 per base pair and could require four weeks for the overall synthesis.
Recent advancement in DNA oligo synthesis on microchips has greatly increased the throughput of oligo synthesis (Zhou et al. 2004). In this method, thousands of oligos were synthesized in parallel in a microfluidic device containing thousands of individual tiny reaction chambers. Each of the reaction chambers has picoliter (pL) volume and oligos synthesized in these chambers are collected after cleavage from the surface as a mixture. The microchip-based synthesis of thousands of oligos consumes the same amount as that of the materials normally for the synthesis of one oligo. The resultant oligo mixture is handled in a microtube, and thus significantly simplifies the process for use of the oligo mixture, such as gene synthesis. This microchip oligo mixture approach was used to construct a full-length green fluorescent protein (GFP) gene 714 bp in length by ligation (Zhou et al. 2004). Alternatively, a method using separate PCR reactions of the oligo mixture followed by removal of the primer sequences through restriction enzyme cleavage and overlapping PCR of the amplified oligo duplexes produced 21 genes encoding E. coli 30S ribosomal proteins, in total of 14.6 kbps in length (Tian et al 2004). Purification by hybridization of the amplified oligos resulted in a nine fold enhancement of fidelity (Tian et al. 2004).
The method of gene synthesis described above overcomes some of the problems associated with slow and expensive oligo synthesis but it is still not suitable for simultaneous assembling of a large number of genes or DNA fragments. The correct assembling of a full length gene or a DNA fragment requires the correct annealing of its component oligos. These oligos are usually 30-50 residue long and thus for a 1 kbp duplex, more than 40 oligos are required. This is a high order reaction of n components (n=number of oligos) and the chance of failure in full length gene assembly is depending on the size of “n”. When n=40 or greater, the change of failure is high. In high throughput gene synthesis, multiple genes, and thus hundreds or more oligos are to be assembled simultaneously. Since oligos there are highly cross reactive in inter-strand base pairing and formation of intra-strand structures, the chances of gene synthesis failure due to the high order reaction and oligo cross interactions dramatically increase as n increases. There are no examples of simultaneous assembling of more than ten genes or long DNA fragments.
Several enzymatic reactions are useful for making long nucleic acids, which including ligation, gap filling (where gap filling may be part of the ligation reaction), chain extension, and PCR. Ligation reaction involves ligation enzymes, ligases, such as DNA ligases: Taq ligase, T4 ligase, and T7 ligase, and RNA ligases: T4 RNA ligase, which joins the 5′-phosphate and the 3′-OH of oligos together by forming a phosphodiester internucleotide bond. In one form of ligation, single stranded nucleic acid sequences or blunt end duplexes are ligated. In another form of ligation, the joining oligos (ligation oligos) are hybridized to a complementary strand (template strand) to form a duplex containing nicking sites and thus the 5′-phosphate and the 3′-OH groups are positioned in close proximity and ligated. In yet another form of ligation, two or more duplexes of a pair partial overlapping oligos hybridize to form a duplex containing consecutive overlapping oligo pairs of adhesive ends. The duplex contains two or more nicking sites and thus the 5′-phosphate and the 3′-OH groups are positioned in close proximity and ligated. In the ligation of duplex forms, the efficiency of the ligation reactions is determined by the complementary base pair (A pairing to T and C pairing to G) of both ligation oligos to the template strand, since ligation is favored by a stable duplex structure at the enzyme reaction site. This base pairing requirement has been explored in detection of specific genomic or RNA sequences and in DNA sequence variation analysis where the changes in sequence, such as an A to G mutation, can be detected by creating a ligation site at the mutation site and the formation of ligation product in the presence of a template strand containing a C but not a T at the site of mutation. These ligation-based methods have been widely used for SNP detection and haplotyping of human genome and for identification of specific genes in gene expression profiling (Landegren et al. 1988; Nickerson et al. 1990; Bibikova et al. 2004; Fan et al. 2004). These applications have a general theme which is to perform ligation in solution in the presence of a template strand followed by the detection of the ligation products through their hybridization to probes on surface and fluorescence, chemical luminescence, or other types of detectable signal readings. An alternative method to solution ligation is a ligation on the surface of optical thin film biosensor arrays by attaching sequence specific probes, which correspond to a genotype (Zhong et al. 2003). This experiment demonstrated the positive ligation of the correct genotype using perfectly matched oligos. The advantage of the ligation-based genetic analysis is enhanced sequence specificity compared to hybridization-based genetic analysis. These methods require pre-synthesized oligos, and thus large-scale experiments suffer from the same limitation as discussed for oligo-based large DNA synthesis.
Oligo-based applications, such as gene synthesis and ligation for detection and quantitation of genetic analysis, are affected by the quality of oligos used. Impure oligos are those that contain incorrect sequences and/or incorrect lengths. These impure oligos cause low fidelity gene synthesis, limit the lengths of DNA that can be synthesized, distort the quantity of analysis, and even produce false positive or false negative results. Although conventional oligo synthesis gives high stepwise yield which is in general greater than 98.5%, the misincorporation of nucleotides (substitution) as well as deletion and insertion are frequently observed at a rate as high as 1/160 bp (Tian et al. 2004). At such an error rate, long DNA (longer than 1 kbp) cannot be assembled at a sufficiently high efficiency. It would then demand large-scale sequencing in order to fish out the correct full length sequences among the many error-containing sequences. Although most of the prior methods of gene assembly have used oligos without purification, several methods have been shown for improving the quality of ligation-based applications: (a) Computer aided design of oligo sequences is used to minimize incorrect hybridization and to optimize the lengths, the sequence composition, the balanced duplex stability which is measured by melting temperature (Tm), and other physiochemical parameters of the oligos (Rouillard et al. 2003). (b) Affinity purification of oligos by hybridization to complementary strands (Zhou et al. 2004; Tian et al 2004) where ligation oligos hybridize to complementary strands immobilized on surface and the error-containing sequences are washed off since they form less stable duplexes. (c) Enzymatic recognition and/or digestion of error-containing DNA sequences such as endonuclease cleavage of mistmatch, bulge, and loop sequences. Examples of the enzymes that can recognize non-complementary nucleic acid duplexes include T7 endonuclease I, T4 endonuclease VII, mutS/mutY/mutL mismatch binding and repair proteins, and single strand binding proteins. (d) Chemical degradation of error-containing DNA sequences. Many organic and inorganic molecules bind and are capable of inducing cleavages in nucleic acids (Gao and Han, 2001). (e) Use of purification tag incorporated in the synthesis of oligos to separate correct from error oligos. Examples of the purification tags include biotin (binding to avidin or stremptavidin), thiol (formation of disulfide bond and binding to gold), and other types of molecular moieties that allow the separation based on binding affinity, charge, or size between the correct and error-containing oligos. (f) Chromatography separation of the correct and error-containing sequences such as DHPLC (Mulligan and Tabone (2003) U.S. Pat. No. 6,664,112).
There are many applications involving the use of synthetic DNAs, such as making RAN or defined sequences by in vitro transcription or protein or peptide libraries. Again, historically, the processes of generating these RNA transcripts or protein products by design from DNA of defined sequences are carried out in a manner of one at a time and thus the making of these biologically important molecules is slow and expensive. It is not a common practice to take advantage of ready-to-use synthetic RNA or protein molecules.
The methods of the present invention overcome the limitations of the prior art methods of gene assembly and provide fast, efficient and cost effective methods for producing one or more oligos or polynucleotides of desired lengths and sequences that can be used in a variety of applications.