In general, gene synthesis means technology of synthesizing long nucleic acid fragments, the lengths of which are 200 base pairs (bp) or more, including genetic information from oligonucleotides as short nucleic acid fragments. To do this, software for designing oligonucleotides for gene synthesis, oligonucleotide synthesis and gene assembly technology using oligonucleotides are necessary. As general oligonucleotide synthesis methods, there are a solid-phase oligo synthesis method, and an oligo synthesis method using a DNA microarray. Methods of assembling oligonucleotides may be broadly classified into three methods, namely, assembly PCR, fusion PCR and ligase chain reaction (LCR) followed by fusion PCR. Synthesized genes must be sequence verified so as to find errors caused by synthesis and assembly of oligonucleotides and to only select nucleic acid fragments having exact genetic information.
Conventional gene synthesis has been performed by dividing exact nucleic acid base sequences of a gene into a variety of short oligonucleotides to synthesize the gene and, after assembling the divided oligonucleotides, selectively retrieving genes having exact nucleic acid base sequences by evaluating through Sanger sequencing (Mol Biosyst. 2009 July; 5(7):714-22. doi: 10.1039/b822268c. Epub 2009 Apr. 6). However, such a method has a limitation due to absence of proper sequencing technology despite the development of various assembly technologies. Recently, thanks to the development of a variety of next-generation sequencing technologies, (for example, various technologies such as Illumina technology, Ion Torrent technology, and 454 technology), the amount of processed sequence information is exponentially increasing and analysis costs are also gradually falling (Carr, P. A. and Church, G. M. (2009) Genome engineering. Nat. Biotechnol., 27, 1151-1162). Although high throughput verification of short oligonucleotides became possible due to the development of next-generation sequencing (NGS) methods, effective use in a final estimation step after completing synthesis was impossible due to a limitation, namely, a short read length, inherent in the next generation sequencing. Since the next generation sequencing has a drawback that a read length of nucleic acid base sequences capable of being analyzed in a batch is short, a synthesized gene goes through a random fragmentation or random shearing process in which the synthesized gene is divided into short fragments again and analysis of the resultant gene is initiated using a next generation sequencer. Subsequently, sequences derived from the next generation sequencer are analyzed and then, using the analyzed result, the DNA fragments are assembled into whole gene sequences by computer software. Such a process has a limitation that it is difficult to judge errors occurring during gene synthesis and nucleic acids sequencing are derived from which fragments. In addition, when the length of a synthesized gene is not long and the kinds of analyzed gene library is small, a method of analyzing a synthesized gene using the next generation sequencing is not an economical method. As such, utilization of the next generation sequencing in gene synthesis is extremely limited.
Broadly understanding a correlation between phenotypes and genotypes of proteins is a very important research subject in protein engineering or biosynthetic pathway engineering. In practice, after engineering a promoter (Patwardhan R P, Lee C, Litvin O, Young D L, Pe'er D, Shendure J. Nature Biotechnology, 27, 1173-1175 (2009)), a short peptide (Whitehead T A, Chevalier A, Song Y, Dreyfus C, Fleishman S J, De Mattos C, Myers C A, Kamisetty H, Blair P, Wilson I A, Baker D. Nature Biotechnology, 30, 543-548 (2012)), a complementarity determining region of a single chain antibody (DeKosky B J, Ippolito G C, Deschner R P, Lavinder J J, Wine Y, Rawlings B M, Varadarajan N, Giesecke C, Dorner T, Andrews S F, Wilson P C, Hunicke-Smith S P, Willson C G, Ellington A D, Georgiou G. Nature Biotechnology, 31, 166-169 (2013), Larman H B, Xu G J, Pavlova N N, Elledge S J. PNAS, 109, 18523-18528 (2012)), research to determine a correlation between phenotypes and genotypes in the engineered sequence has been continuously performed. However, such research does not commonly target a total region of a protein due to a short read length in next generation sequencing, and an short domain region than a read length can be engineered. So as to engineer the total region of a protein, library must be sequenced through Sanger sequencing or next generation sequencing information (short reads) must be reassembled. The former case is very inefficient since it is time-consuming and laborious, and large costs are required. The latter case is prohibited by currently known methods.