Throughout this application, various publications are referenced in parentheses by author and year. Full citations for these references may be found at the end of the specification immediately preceding the claims. The disclosures of these publications in their entireties are hereby incorporated by reference into this application to more fully describe the state of the art to which this invention pertains.
The ability to sequence deoxyribonucleic acid (DNA) accurately and rapidly is revolutionizing biology and medicine. The confluence of the massive Human Genome Project is driving an exponential growth in the development of high throughput genetic analysis technologies. This rapid technological development involving chemistry, engineering, biology, and computer science makes it possible to move from studying single genes at a time to analyzing and comparing entire genomes.
With the completion of the first entire human genome sequence map, many areas in the genome that are highly polymorphic in both exons and introns will be known. The pharmacogenomics challenge is to comprehensively identify the genes and functional polymorphisms associated with the variability in drug response (Roses, 2000). Resequencing of polymorphic areas in the genome that are linked to disease development will contribute greatly to the understanding of diseases, such as cancer, and therapeutic development. Thus, high-throughput accurate methods for resequencing the highly variable intron/exon regions of the genome are needed in order to explore the full potential of the complete human genome sequence map. The current state-of-the-art technology for high throughput DNA sequencing, such as used for the Human Genome Project (Pennisi 2000), is capillary array DNA sequencers using laser induced fluorescence detection (Smith et al., 1986; Ju et al. 1995, 1996; Kheterpal et al. 1996; Salas-Solano et al. 1998). Improvements in the polymerase that lead to uniform termination efficiency and the introduction of thermostable polymerases have also significantly improved the quality of sequencing data (Tabor and Richardson, 1987, 1995). Although capillary array DNA sequencing technology to some extent addresses the throughput and read length requirements of large scale DNA sequencing projects, the throughput and accuracy required for mutation studies needs to be improved for a wide variety of applications ranging from disease gene discovery to forensic identification. For example, electrophoresis based DNA sequencing methods have difficulty detecting heterozygotes unambiguously and are not 100% accurate in regions rich in nucleotides comprising guanine or cytosine clue to compressions (Bowling et al. 1991; Yamakawa et al. 1997). In addition, the first few bases after the priming site are often masked by the high fluorescence signal from excess dye-labeled primers or dye-labeled terminators, and are therefore difficult to identify. Therefore, the requirement of electrophoresis for DNA sequencing is still the bottleneck for high-throughput DNA sequencing and mutation detection projects.
The concept of sequencing DNA by synthesis without using electrophoresis was first revealed in 1988 (Hyman, 1988) and involves detecting the identity of each nucleotide as it is incorporated into the growing strand of DNA in a polymerase reaction. Such a scheme coupled with the chip format and laser-induced fluorescent detection has the potential to markedly increase the throughput of DNA sequencing projects. Consequently, several groups have investigated such a system with an aim to construct an ultra high-throughput DNA sequencing procedure (Cheeseman 1994, Metzker et al. 1994). Thus far, no complete success of using such a system to unambiguously sequence DNA has been reported. The pyrosequencing approach that employs four natural nucleotides (comprising a base of adenine (A), cytosine (C), guanine (G), or thymine (T)) and several other enzymes for sequencing DNA by synthesis is now widely used for mutation detection (Ronaghi 1998). In this approach, the detection is based on the pyrophosphate (PPi) released during the DNA polymerase reaction, the quantitative conversion of pyrophosphate to adenosine triphosphate (ATP) by sulfurylase, and the subsequent production of visible light by firefly luciferase. This procedure can only sequence up to 30 base pairs (bps) of nucleotide sequences, and each of the 4 nucleotides needs to be added separately and detected separately. Long stretches of the same bases cannot be identified unambiguously with the pyrosequencing method.
More recent work in the literature exploring DNA sequencing by a synthesis method is mostly focused on designing and synthesizing a photocleavable chemical moiety that is linked to a fluorescent dye to cap the 3′-OH group of deoxynucleoside triphosphates (dNTPs) (Welch et al. 1999). Limited success for the incorporation of the 3′-modified nucleotide by DNA polymerase is reported. The reason is that the 3′-position on the deoxyribose is very close to the amino acid residues in the active site of the polymerase, and the polymerase is therefore sensitive to modification in this area of the deoxyribose ring. On the other hand, it is known that modified DNA polymerases (Thermo Sequenase and Taq FS polymerase) are able to recognize nucleotides with extensive modifications with bulky groups such as energy transfer dyes at the 5-position of the pyrimidines (T and C) and at the 7-position of purines (G and A) (Rosenblum et al. 1997, Zhu et al. 1994). The ternary complexes of rat DNA polymerase, a DNA template-primer, and dideoxycytidine triphosphate (ddCTP) have been determined (Pelletier et al. 1994) which supports this fact. As shown in FIG. 1, the 3-D structure indicates that the surrounding area of the 3′-position of the deoxyribose ring in ddCTP is very crowded, while there is ample space for modification on the 5-position the cytidine base.
The approach disclosed in the present application is to make nucleotide analogues by linking a unique label such as a fluorescent dye or a mass tag through a cleavable linker to the nucleotide base or an analogue of the nucleotide base, such as to the 5-position of the pyrimidines (T and C) and to the 7-position of the purines (G and A), to use a small cleavable chemical moiety to cap the 3′-OH group of the deoxyribose to make it nonreactive and to incorporate the nucleotide analogues into the growing DNA strand as terminators. Detection of the unique label will yield the sequence identity of the nucleotide. Upon removing the label and the 3′-OH capping group, the polymerase reaction will proceed to incorporate the next nucleotide analogue and detect the next base.
It is also desirable to use a photocleavable group to cap the 3′-OH group. However, a photocleavable group is generally bulky and thus the DNA polymerase will have difficulty to incorporate the nucleotide analogues containing a photocleavable moiety capping the 3′-OH group. If small chemical moieties that can be easily cleaved chemically with high yield can be used to cap the 3′-OH group, such nucleotide analogues should also be recognized as substrates for DNA polymerase. It has been reported that 3′-O-methoxy-deoxynucleotides are good substrates for several polymerases (Axelrod et al. 1978). 3′-O-allyl-dATP was also shown to be incorporated by Ventr(exo-) DNA polymerase in the growing strand of DNA (Metzker et al. 1994). However, the procedure to chemically cleave the methoxy group is stringent and requires anhydrous conditions. Thus, it is not practical to use a methoxy group to cap the 3′-OH group for sequencing DNA by synthesis. An ester group was also explored to cap the 3′-OH group of the nucleotide, but it was shown to be cleaved by the nucleophiles in the active site in DNA polymerase (Canard et al. 1995). Chemical groups with electrophiles such as ketone groups are not suitable for protecting the 3′-OH of the nucleotide in enzymatic reactions due to the existence of strong nucleophiles in the polymerase. It is known that MOM (—CH2OCH3) and allyl (—CH2CH═CH2) groups can be used to cap an —OH group, and can be cleaved chemically with high yield (Ireland et al. 1986; Kamal et al. 1999). The approach disclosed in the present application is to incorporate nucleotide analogues, which are labeled with cleavable, unique labels such as fluorescent dyes or mass tags and where the 3′-OH is capped with a cleavable chemical moiety such as either a MOM group (—CH2OCH3) or an allyl group (—CH2CH═CH2), into the growing strand DNA as terminators. The optimized nucleotide set (3′-RO-A-LABEL1, 3′-RO—C—LABEL2, 3′-RO-G-LABEL3, 3′-RO-T-LABEL4, where R denotes the chemical group used to cap the 3′-OH) can then be used for DNA sequencing by the synthesis approach.
There are many advantages of using mass spectrometry (MS) to detect small and stable molecules. For example, the mass resolution can be as good as one dalton. Thus, compared to gel electrophoresis sequencing systems and the laser induced fluorescence detection approach which have overlapping fluorescence emission spectra, leading to heterozygote detection difficulty, the MS approach disclosed in this application produces very high resolution of sequencing data by detecting the cleaved small mass tags instead of the long DNA fragment. This method also produces extremely fast separation in the time scale of microseconds. The high resolution allows accurate digital mutation and heterozygote detection. Another advantage of sequencing with mass spectrometry by detecting the small mass tags is that the compressions associated with gel based systems are completely eliminated.
In order to maintain a continuous hybridized primer extension product with the template DNA, a primer that contains a stable loop to form an entity capable of self-priming in a polymerase reaction can be ligated to the 3′ end of each single stranded DNA template that is immobilized on a solid surface such as a chip. This approach will solve the problem of washing off the growing extension products in each cycle.
Saxon and Bertozzi (2000) developed an elegant and highly specific coupling chemistry linking a specific group that contains a phosphine moiety to an azido group on the surface of a biological cell. In the present application, this coupling chemistry is adopted to create a solid surface which is coated with a covalently linked phosphine moiety, and to generate polymerase chain reaction (PCR) products that contain an azido group at the 5′ end for specific coupling of the DNA template with the solid surface. One example of a solid surface is glass channels which have an inner wall with an uneven or porous surface to increase the surface area. Another example is a chip.
The present application discloses a novel and advantageous system for DNA sequencing by the synthesis approach which employs a stable DNA template, which is able to self prime for the polymerase reaction, covalently linked to a solid surface such as a chip, and 4 unique nucleotides analogues (3′-RO-A-LABEL1, 3′-RO—C—LABEL2, 3′-RO-G-LABEL3, 3′-RO-T-LABEL4). The success of this novel system will allow the development of an ultra high-throughput and high fidelity DNA sequencing system for polymorphism, pharmacogenetics applications and for whole genome sequencing. This fast and accurate DNA resequencing system is needed in such fields as detection of single nucleotide polymorphisms (SNPs) (Chee et al. 1996), serial analysis of gene expression (SAGE) (Velculescu et al. 1995), identification in forensics, and genetic disease association studies.