The present invention relates generally to organic chemistry, analytical chemistry, biochemistry, molecular biology, genetics, diagnostics and medicine. In particular, it relates to a method for detecting polymorphisms, in particular single nucleotide polymorphisms, in polynucleotides.
The following is offered as background information only and is not intended nor admitted to be prior art to the present invention.
DNA is the carrier of the genetic information of all living cells. An organism""s genetic and physical characteristics, its genotype and phenotype, respectively, are controlled by precise nucleic acid sequences in the organism""s DNA. The sum total of all of the sequence information present in an organism""s DNA is termed the organism""s xe2x80x9cgenome.xe2x80x9d The nucleic acid sequence of a DNA molecule consists of a linear polymer of four xe2x80x9cnucleotides.xe2x80x9d The four nucleotides are tripartite molecules, each consisting of (1) one of the four heterocyclic bases, adenine (abbreviated xe2x80x9cAxe2x80x9d), cytosine (xe2x80x9cCxe2x80x9d), guanine (xe2x80x9cGxe2x80x9d) and thymine (xe2x80x9cTxe2x80x9d); (2) the pentose sugar derivative 2-deoxyribose which is bonded by its 1-carbon atom to a ring nitrogen atom of the heterocyclic bases; and (3) a monophosphate monoester formed between a phosphoric acid molecule and the 5xe2x80x2-hydroxy group of the sugar moiety. The nucleotides polymerize by the formation of diesters between the 5xe2x80x2-phosphate of one nucleotide and the 3xe2x80x2-hydroxy group of another nucleotide to give a single strand of DNA. In nature, two of these single strand interact by hydrogen bonding between complementary nucleotides, A being complementary with T and C being complementary with G, to form xe2x80x9cbase-pairsxe2x80x9d which results in the formation of the well-known DNA xe2x80x9cdouble helixxe2x80x9d of Watson and Crick. RNA is similar to DNA except that the base thymine is replaced with uracil (xe2x80x9cUxe2x80x9d) and the pentose sugar is ribose itself rather than deoxyribose. In addition, RNA exists in nature predominantly as a single strand; i.e., two strands do not normally combine to form a double helix.
When referring to sequences of nucleotides in a polynucleotide, it is customary to use the abbreviation for the base; i.e., A, C, G, and T (or U) to represent the entire nucleotide containing that base. For example, a polynucleotide sequence denoted as xe2x80x9cACGxe2x80x9d means that an adenine nucleotide is bonded through a phosphate ester linkage to a cytosine nucleotide that is bonded through another phosphate ester linkage to a guanine nucleotide. If the polynucleotide being described is DNA, then it is understood that xe2x80x9cAxe2x80x9d refers to an adenine nucleotide that contains a deoxyribose sugar. If there is any possibility of ambiguity, the xe2x80x9cAxe2x80x9d of a DNA molecule can be designated xe2x80x9cdeoxyAxe2x80x9d or simply xe2x80x9cdA.xe2x80x9d The same is true for C and G. Since T occurs only in DNA and not RNA, there can be no ambiguity so there is no need to refer to deoxyT or dT.
As a rough approximation, it can be said that the number of genes an organism has is proportional to the organism""s phenotypic complexity; i.e., the number of genome products necessary to replicate the organism and allow it to function. The human genome, presently considered one of the most complex, consists of approximately 60,000-100,000 genes and about three billion three hundred million base pairs. Each of these genes codes for an RNA, most of which in turn encodes a particular protein which performs a specific biochemical or structural function. A variance, also known as a polymorphism or mutation, in the genetic code of any one of these genes may result in the production of a gene product, usually a protein or an RNA, with altered biochemical activity or with no activity at all. This can result from as little change as an addition, deletion or substitution (transition or transversion) of a single nucleotide in the DNA comprising a particular gene that is sometimes referred to as a xe2x80x9csingle nucleotide polymorphismxe2x80x9d or xe2x80x9cSNP. The consequence of such a mutation in the genetic code ranges from harmless to debilitating to fatal. There are presently over 6700 human disorders believed to have a genetic component. For example, hemophilia, Alzheimer""s disease, Huntington""s disease, Duchenne muscular dystrophy and cystic fibrosis are known to be related to variances in the nucleotide sequence of the DNA comprising certain genes. In addition, evidence is being amassed suggesting that changes in certain DNA sequences may predispose an individual to a variety of abnormal conditions such as obesity, diabetes, cardiovascular disease, central nervous system disorders, auto-immune disorders and cancer. Variations in DNA sequence of specific genes have also been implicated in the differences observed among patients in their responses to, for example, drugs, radiation therapy, nutritional status and other medical interventions. Thus, the ability to detect DNA sequence variances in an organism""s genome is an important aspect of the inquiry into relationships between such variances and medical disorders and responses to medical interventions. Once an association has been established, the ability to detect the variance(s) in the genome of a patient can be an extremely useful diagnostic tool. It may even be possible, using early variance detection, to diagnose and potentially treat, or even prevent, a disorder before the disorder has physically manifested itself. Furthermore, variance detection can be a valuable research tool in that it may lead to the discovery of genetic bases for disorders the cause of which were hitherto unknown or thought to be other than genetic. Variance detection may also be useful for guiding the selection of an optimal therapy where there is a difference in response among patients to one or more proposed therapies.
While the benefits of being able to detect variances in the genetic code are clear, the practical, aspects of doing so are daunting: it is estimated that sequence variations in human DNA occur with a frequency of about 1 in 100 nucleotides when 50 to 100 individuals are compared. Nickerson, D. A., Nature Genetics, 1998, 223-240. This translates to as many as thirty million variances in the human genome. Not all, in fact very few, of these variances have any measurable effect on the physical well being of humans. Detecting these 30 million variances and then determining which of them are relevant to human health is clearly a formidable task.
In addition to variance detection, knowledge of the complete nucleotide sequence of an organism""s genome would contribute immeasurably to the understanding of the organism""s overall biology, i.e., it would lead to the identification of every gene product, its organization and arrangement in the organism""s genome, the sequences required for controlling gene expression (i.e., production of each gene product) and replication. In fact, the quest for such knowledge and understanding is the raison d""xc3xaatre for the Human Genome Project, an international effort aimed at sequencing the entire human genome. Once the sequence of a single genome is available, whatever the organism, it then becomes useful to obtain the partial or complete sequence of other organisms of that species, particularly those organisms within the species that exhibit different characteristics, in order to identify DNA sequence differences that correlate with the different characteristics. Such different characteristics may include, for microbial organisms, pathogenicity on the negative side or the ability to produce a particular polymer or to remediate pollution on the positive side. A difference in growth rate, nutrient content or pest resistance is a potential difference that might be observed among plants. Even among human beings, a difference in disease susceptibility or response to a particular therapy might relate to a genetic, i.e., DNA sequence, variation. As a result of the enormous potential utility to be realized from DNA sequence information, in particular, identification of DNA sequence variances between individuals of the same species, the demand for rapid, inexpensive, automated DNA sequencing and variance detection procedures can be expected to increase dramatically in the future.
Once the DNA sequence of a DNA segment; e.g., a gene, a cDNA or, on a larger scale, a chromosome or an entire genome, has been determined, the existence of sequence variances in that DNA segment among members of the same species can be explored. Complete DNA sequencing is the definitive procedure for accomplishing this task. Thus, it is possible to determine the complete sequence of a copy of a DNA segment obtained from a different member of the specie and simply compare that complete sequence to the one previously obtained. However, current DNA sequencing technology is costly, time consuming and, in order to achieve high levels of accuracy, must be highly redundant. Most major sequencing projects require a 5- to 10-fold coverage of each nucleotide to reach an acceptable error rate of 1 in 2,000 to 1 in 10,000 bases. In addition, DNA sequencing is an inefficient way to detect variances. For example, a variance between any two copies of a gene, for example when two chromosomes are being compared, may occur as infrequently as once in 1,000 or more bases. Thus, only a small portion of the sequence is of interest, i.e., that portion in which the variance exists. However, if full sequencing is employed, a tremendous number of nucleotides have to be sequenced to arrive at the desired information involving the aforesaid small portion. For example, consider a comparison of ten versions of a 3,000 nucleotide DNA sequence for the purpose of detecting, say, four variances among them. Even if only a 2-fold redundancy is employed (each strand of the double-stranded 3,000 nucleotide DNA segment from each individual is sequenced once), 60,000 nucleotides would have to be sequenced (10xc3x973,000xc3x972). In addition, it is more than likely that problem areas will be encountered in the sequencing requiring additional runs with new primers; thus, the project could engender the sequencing of as many as 100,000 nucleotides to determine four variances. A variety of procedures have been developed over the past 15 years to identify sequence differences and to provide some information about the location of the variant sites. Using such a procedure, it might only be necessary to sequence four relatively short portions of the 3000 nt (nucleotide) sequence. Furthermore, only a few samples would have to be sequenced in each region because each variance produces a characteristic change so, if, for example, 22 of 50 samples exhibit a such a characteristic change with a variation detection procedure, then sequencing as few as four samples of the 22 would provide information on the other 18. The length of the segments that require sequencing could, depending on the variance detection procedure employed, be as short as 50-100 nt. Thus, the scale of the sequencing project could be reduced to: 4 (sites)xc3x9750 (nt per site)xc3x972 (strands from each individual)xc3x972 (individuals per site) or only about 800 nucleotides. This amounts to about 1% of the sequencing required in the absence of a preceding variance detection step.
As presently practiced, the technique for determining the full nucleotide sequence of a polynucleotide and that for detecting previously unknown variances or mutations in related polynucleotides ends up being the same; that is, even when the issue is the presence or absence of a single nucleotide variance between related polynucleotides, the complete sequences of at least a segment of the related polynucleotides is determined and then compared.
The two classical methods for carrying out complete nucleotide sequencing are the Maxam and Gilbert chemical procedure (Proc. Nat. Acad. Sci. USA, 74, 560-564 (1977)) and the Sanger, et al., chain-terminating procedure (Proc. Nat. Acad. Sci. USA, 74, 5463-5467 (1977)).
The Maxam-Gilbert method of complete nucleotide sequencing involves end-labeling a DNA molecule with, for example, 32P, followed by one of two discrete reaction sequences involving two reactions each; i.e., four reactions overall. One of these reaction sequences involves the selective methylation of the purine nucleotides guanine (G) and adenine (A) in the polynucleotide being investigated which, in most instances, is an isolated naturally-occurring polynucleotide such as DNA. The N7 position of guanine methylates approximately five times as rapidly as the N3 position of adenine. When heated in the presence of aqueous base, the methylated bases are lost and a break in the polynucleotide chain occurs. The reaction is more effective with methylated guanine than with methylated adenine so, when the reaction product is subjected to electrophoresis on polyacrylamide gel plates, G cleavage residues are dark and A cleavage residues are light. This pattern can be reversed by using acid instead of heat to release the methylated bases. That is, using acid, the A cleavage residues show up dark on electrophoresis and the G cleavage residues show up light.
The second set of reaction sequences in the Maxam-Gilbert approach identifies cytosine and thymine cleavage residues. That is, the pyrimidine bases of which DNA is comprised, cytosine (C) and thymine (T), are, under the Maxam-Gilbert approach, differentiated by treatment of the isolated naturally-occurring polynucleotide with hydrazine which reacts equally effectively with either base except in the presence of a high salt concentration where it reacts only with cytosine. Thus, depending on the conditions used, two series of bands can be generated on electrophoresis; in low salt, both C and T will be cleaved so the bands represent C+T; in high salt only C is cleaved so the bands will show C only.
Thus, four chemical reactions followed by electrophoretic analysis of the resulting end-labeled ladder of cleavage products will reveal the exact nucleotide sequence of a DNA molecule. It is key to the Maxam-Gilbert sequencing method that only partial cleavage, on the order of 1-2% at each susceptible position, occurs. This is because electrophoresis separates fragments by size. To be meaningful, the fragments produced should, on the average, represent a single modification and cleavage per molecule. Then, when the fragments of all four reactions are aligned according to size, the exact sequence of the target DNA can be determined.
The Sanger method for determining complete nucleotide sequences consists of preparing four series of base-specifically chain-terminated, labeled DNA fragments by enzymatic polymerization. As in the Maxam-Gilbert procedure, four separate reactions can be performed or, if labeled dideoxynucleotide terminators are used, the reactions can all be carried out in the same test tube. In the Sanger method each of the four reaction mixtures contains the same oligonucleotide template (either a single- or a double-stranded DNA), the four nucleotides, A, G, C and T (one of which may be labeled), a polymerase and a primer, the polymerase and primer being present to effect the polymerization of the nucleotides into a complement of the template oligonucleotide. To one of the four reaction mixtures is added an empirically determined amount of the dideoxy derivative of one of the nucleotides. A small amount of the dideoxy derivative of one of the remaining three nucleotides is added to a second reaction mixture, and so on, resulting in four reaction mixtures each containing a different dideoxy nucleotide. The dideoxy derivatives, by virtue of their missing 3xe2x80x2-hydroxyl groups, terminates the enzymatic polymerization reaction upon incorporation into the nascent oligonucleotide chain. Thus, in one reaction mixture, containing, say, dideoxyadenosine triphosphate (ddATP), a series of oligonucleotide fragments are produced all ending in ddA which when resolved by electrophoresis produce a series of bands corresponding to the size of the fragment created up to the point that the chain-terminating ddA became incorporated into the polymerization reaction. Corresponding ladders of fragments can be obtained from each of the other reaction mixtures in which the oligonucleotide fragments end in C, G and T. The four sets of fragments create a xe2x80x9csequence ladder,xe2x80x9d each rung of which represents the next nucleotide in the sequence of bases comprising the subject DNA. Thus, the exact nucleotide sequence of the DNA can simply be read off the electrophoresis gel plate after autoradiography or computer analysis of chromatograms in the case of an automated DNA sequencing instrument. As mentioned above, dye-labeled chain terminating dideoxynucleotides and modified polymerases that efficiently incorporate modified nucleotides represent improved method for chain-terminating sequencing.
Both the Maxam-Gilbert and Sanger procedures have their shortcomings. They are both time-consuming, labor-intensive (particularly with regard to the Maxam-Gilbert procedure which has not been automated like the Sanger procedure), expensive (e.g., the most optimized versions of the Sanger procedure require very expensive reagents) and require a fair degree of technical expertise to assure proper operation and reliable results. Furthermore, the Maxam-Gilbert procedure suffers from a lack of specificity of the modification chemistry that can result in artifactual fragments resulting in false ladder readings from the gel plate. The Sanger method, on the other hand, is susceptible to template secondary structure formation that can cause interference in the polymerization reaction. This causes terminations of the polymerization at sights of secondary structure (called xe2x80x9cstopsxe2x80x9d) which can result in erroneous fragments appearing in the sequence ladder rendering parts of the sequence unreadable, although this problem is ameliorated by the use of dye labeled dideoxy terminator. Furthermore, both sequencing methods are susceptible to xe2x80x9ccompressions,xe2x80x9d another result of DNA secondary structure which can affect fragment mobility during electrophoresis thereby rendering the sequence ladder unreadable or subject to erroneous interpretation in the vicinity of the secondary structure. In addition, both methods are plagued by uneven intensity of the ladder and by non-specific background interference. These concerns are magnified when the issue is variance detection. In order to discern a single nucleotide variance, the procedure employed must be extremely accurate, a xe2x80x9cmistakexe2x80x9d in reading one nucleotide can result in a false positive; i.e., an indication of a variance where none exists. Neither the Maxam-Gilbert nor the Sanger procedures are capable of such accuracy in a single run. In fact, the frequency of errors in a xe2x80x9cone passxe2x80x9d sequencing experiment is equal to or greater than 1%, which is on the order of ten times the frequency of actual DNA variances when any two versions of a sequence are compared. The situation can be ameliorated somewhat by performing multiple runs (usually in the context of a xe2x80x9cshotgunxe2x80x9d sequencing procedure) for each polynucleotide being compared, but this simply increases cost in terms of equipment, reagents, manpower and time. The high cost of sequencing becomes even less acceptable when one considers that it is often not necessary when looking for nucleotide sequence variances among related polynucleotides to determine the complete sequence of the subject polynucleotides or even the exact nature of the variance (although, as will be seen, in some instances even this is discernable using the method of this invention); detection of the variance alone may be sufficient.
While not avoiding all of the problems associated with the Maxam-Gilbert and Sanger procedures, several techniques have been devised to at least make one or the other of the procedures more efficient. One such approach has been to develop ways to circumvent slab gel electrophoresis, one of the most time-consuming steps in the procedures. For instance, in U.S. Pat. Nos. 5,003,059 and 5,174,962, the Sanger method is employed; however, the dideoxy derivative of each of the nucleotides used to terminate the polymerization reaction is uniquely tagged with an isotope of sulfur, 32S, 33S, 34S or 36S. Once the polymerization reactions are complete, the chain terminated sequences are separated by capillary zone electrophoresis, which, compared to slab gel electrophoresis, increases resolution, reduces run time and allows analysis of very small samples. The separated chain terminated sequences are then combusted to convert the incorporated isotopic sulfur to isotopic sulfur dioxides (32SO2, 33SO2, 34SO2 and 36SO2). The isotopic sulfur dioxides are then subjected to mass spectrometry. Since each isotope of sulfur is uniquely related to one of the four sets of base-specifically chain terminated fragments, the nucleotide sequence of the subject DNA can be determined from the mass spectrogram.
Another method, disclosed in U.S. Pat. No. 5,580,733, also incorporates the Sanger technique but eliminates gel electrophoresis altogether. The method involves taking each of the four populations of base-specific chain-terminated oligonucleotides from the Sanger reactions and forming a mixture with a visible laser light absorbing matrix such as 3-hydroxypicolinic acid. The mixtures are then illuminated with visible laser light and vaporized, which occurs without further fragmentation of the chain-terminated nucleic acid fragments. The vaporized molecules that are charged are then accelerated in an electric filed and the mass to charge (m/z) ratio of the ionized molecules determined by time-of-flight mass spectrometry (TOF-MS). The molecular weights are then aligned to determine the exact sequence of the subject DNA. By measuring the mass difference between successive fragments in each of the mixtures, the lengths of fragments terminating in A, G, C or T can then be inferred. A significant limitation of current MS instruments is that polynucleotide fragments greater than 100 nucleotides in length (with many instruments, 50 nucleotides) cannot be efficiently detected in routine use, especially if the fragments are part of a complex mixture. This severe limitation on the size of fragments that can be analyzed has limited the development of polynucleotide analysis by MS. Thus, there is a need for a procedure that adapts large polynucleotides, such as DNA, to the capabilities of current MS instruments. The present invention provides such a procedure.
A further approach to nucleotide sequencing is disclosed in U.S. Pat. No. 5,547,835. Again, the starting point is the Sanger sequencing strategy. The four base specific chain-terminated series of fragments are xe2x80x9cconditionedxe2x80x9d by, for example, purification, cation exchange and/or mass modification. The molecular weights of the conditioned fragments are then determined by mass spectrometry and the sequence of the starting nucleic acid is determined by aligning the base specific terminated fragments according to molecular weight.
Each of the above methods involves complete Sanger sequencing of a polynucleotide prior to analysis by mass spectrometry. To detect genetic mutations; i.e., variances, the complete sequence can be compared to a known nucleotide sequence. Where the sequence is not known, comparison with the nucleotide sequence of the same DNA isolated from another of the same organisms which does not exhibit the abnormalities seen in the subject organism will likewise reveal mutations. This approach, of course, requires running the Sanger procedure twice; i.e., eight separate reactions. In addition, if a potential variance is detected, the entire procedure would in most instances be run again, sequencing the opposite strand using a different primer to make sure that a false positive had not been obtained. When the specific nucleotide variance or mutation related to a particular disorder is known, there are a wide variety of known methods for detecting a variance without complete sequencing. For instance, U.S. Pat. No. 5,605,798 describes such a method. The method involves obtaining a nucleic acid molecule containing the target sequence of interest from a biological sample, optionally amplifying the target sequence, and then hybridizing the target sequence to a detector oligonucleotide which is specifically designed to be complementary to the target sequence. Either the detector oligonucleotide or the target sequence is xe2x80x9cconditionedxe2x80x9d by mass modification prior to hybridization. Unhybridized detector oligonucleotide is removed and the remaining reaction product is volatilized and ionized. Detection of the detector oligonucleotide by mass spectrometry indicates the presence of the target nucleic acid sequence in the biological sample and thus confirms the diagnosis of the variance-related disorder.
Variance detection procedures can be divided into two general categories although there is a considerable degree of overlap. One category, the variance discovery procedures, is useful for examining DNA segments for the existence, location and characteristics of new variances. To accomplish this, variance discovery procedures may be combined with DNA sequencing.
The second group of procedures, variance typing (sometimes referred to as genotyping) procedures, are useful for repetitive determination of one or more nucleotides at a particular site in a DNA segment when the location of a variance or variances has previously been identified and characterized. In this type of analysis, it is often possible to design a very sensitive test of the status of a particular nucleotide or nucleotides. This technique, of course, is not well suited to the discovery of new variances.
Some of the methods that have been developed specifically for genotyping include: (1) primer extension methods in which dideoxynucleotide termination of the primer extension reaction occurs at the variant site generating extension products of different length or with different terminal nucleotides, which can then be determined by electrophoresis, mass spectrometry or fluorescence in a plate reader; (2) hybridization methods in which oligonucleotides corresponding to the two possible sequences at a variant site are attached to a solid surface and hybridized with probes from the unknown sample; (3) restriction fragment length polymorphism analysis, wherein a restriction endonuclease recognition site includes the polymorphic nucleotide in such a manner that the site is cleavable with one variant nucleotide but not another; (4) methods such as xe2x80x9cTaqManxe2x80x9d involving differential hybridization and consequent differential 5xe2x80x2 endonuclease digestion of labeled oligonucleotide probes in which there is fluorescent resonance energy transfer (FRET) between two fluors on the probe that is abrogated by nuclease digestion of the probe; (5) other FRET based methods involving labeled oligonucleotide probes called molecular beacons which exploit allele specific hybridization; (6) ligation dependent methods that require enzymatic ligation of two oligonucleotides across a polymorphic site that is perfectly matched to only one of them; and, (7) allele specific oligonucleotide priming in a polymerase chain reaction (PCR). U. Landegren, et al., 1998, Reading Bits of Genetic Information: Methods for Single-nucleotide Polymorphism Analysis, Genome Research 8(8):769-76.
When complete sequencing of large templates such as the entire genome of a virus, a bacterium or a eukaryote (i.e., higher organisms including man) or the repeated sequencing of a large DNA region or regions from different strains or individuals of a given species for purposes of comparison is desired, it becomes necessary to implement strategies for making libraries of templates for DNA sequencing. This is because conventional chain terminating sequencing (i.e., the Sanger procedure) is limited by the resolving power of the analytical procedure used to create the nucleotide ladder of the subject polynucleotide. For gels, this resolving power is approximately 500-800 nt at a time. For mass spectrometry, the limitation is the length of a polynucleotide that can be efficiently vaporized prior to detection in the instrument. Although larger fragments have been analyzed by highly specialized procedures and instrumentation, presently this limit is approximately 50-60 nt. However, in large scale sequencing projects such as the Human Genome Project, xe2x80x9cmarkersxe2x80x9d (DNA segments of known chromosomal location whose presence can be relatively easily ascertained by the polymer chain reaction (PCR) technique and which, therefore, can be used as a point of reference for mapping new areas of the genome) are currently about 100 kilobases (Kb) apart. The markers at 100 Kb intervals must be connected by efficient sequencing strategies. If the analytical method used is gel electrophoresis, then to sequence a 100 kb stretch of DNA would require hundreds of sequencing reactions. A fundamental question which must be addressed is how to divide up the 100 kB segment (or whatever size is being dealt with) to optimize the process; i.e., to minimize the number of sequencing reactions and sequence assembly work necessary to generate a complete sequence with the desired level of accuracy. A key issue in this regard is how to initially fragment the DNA in such a manner that the fragments, once sequenced, can be correctly reassembled to recreate the full-length target DNA. Presently, two general approaches provide both sequence-ready fragments and the information necessary to recombine the sequences into the full-length target DNA: xe2x80x9cshotgun sequencingxe2x80x9d (see, e.g., Venter, J. C., et al., Science, 1998, 280:1540-1542; Weber, J. L. and Myers, E. W., Genome Research, 1997, 7:401-409; Andersson, B. et al., DNA Sequence, 1997, 7:63-70) and xe2x80x9cdirected DNA sequencingxe2x80x9d (see, e.g., Voss, H., et al., Biotechniques, 1993, 15:714-721; Kaczorowski, T., et al., Anal. Biochem., 1994, 221:127-135; Lodhi, M. A., et al., Genome Research, 1996, 6:10-18).
Shotgun sequencing involves the creation of a large library of random fragments or xe2x80x9cclonesxe2x80x9d in a sequence-ready vector such as a plasmid or phagemid. To arrive at a library in which all portions of the original sequence are relatively equally represented, DNA that is to be shotgun sequenced is often fragmented by physical procedures such as sonication which has been shown to produce nearly random fragmentation. Clones are then selected at random from the shotgun library for sequencing. The complete sequence of the DNA is then assembled by identifying overlapping sequences in the short (approx. 500 nt) shotgun sequences. In order to assure that the entire target region of the DNA is represented among the randomly selected clones and to reduce the frequency of errors (incorrectly assigned overlaps), a high degree of sequencing redundancy is necessary; for example, 7 to 10-fold. Even with such high redundancy, additional sequencing is often required to fill gaps in the coverage. Even then, the presence of repeat sequences such as Alu (a 300 base-pair sequence which occurs in 500,000-1,000,000 copies per haploid genome) and LINES (xe2x80x9cLong INterspersed DNA sequence Elementsxe2x80x9d which can be 7,000 bases long and may be present in as many as 100,000 copies per haploid genome), either of which may occur in different locations of multiple clones, can render DNA sequence re-assembly problematic. For instance, different members of these sequence families can be over 90% identical which can sometimes make it very difficult to determine sequence relationships on opposite sides of such repeats. Figure X illustrates the difficulties of the shotgun sequencing approach in a hypothetical 10 kb sequence modeled after the sequence reported in Martin-Gallardo, et al., Nature Genetics, (1992), 1:34-39.
Directed DNA sequencing, the second general approach, also entails making a library of clones, often with large inserts (e.g., cosmid, P1, PAC or BAC libraries). In this procedure, the location of the clones in the region to be sequenced is then mapped to obtain a set of clones that constitutes a minimum-overlap tiling path spanning the region to be sequenced. Clones from this minimal set are then sequenced by procedures such as xe2x80x9cprimer walkingxe2x80x9d (see, e.g., Voss, supra). In this procedure, the end of one sequence is used to select a new sequencing primer with which to begin the next sequencing reaction, the end of the second sequence is used to select the next primer and so on. The assembly of a complete DNA is easier by direct sequencing and less sequencing redundancy is required since both the order of clones and the completeness of coverage is known from the clone map. On the other hand, assembling the map itself requires significant effort. Furthermore, the speed with which new sequencing primers can be synthesized and the cost of doing so is often a limiting factor with regard to primer walking. While a variety of methods for simplifying new primer construction have aided in this process (see, e.g. Kaczorowski, et al. and Lodhi, et al., supra), directed DNA sequencing remains a valuable but often expensive and slow procedure.
Most large-scale sequencing projects employ aspects of both shotgun sequencing and directed sequencing. For example, a detailed map might be made of a large insert library (e.g., BACs) to identify a minimal set of clones which gives complete coverage of the target region but then sequencing of each of the large inserts is carried out by a shotgun approach; e.g., fragmenting the large insert and re-cloning the fragments in a more optimal sequencing vector (see, e.g., Chen, C. N., Nucleic Acids Research, 1996, 24:4034-4041). The shotgun and directed procedures are also used in a complementary manner in which specific regions not covered by an initial shotgun experiment are subsequently determined by directed sequencing.
Thus, there are significant limitations to both the shotgun and directed sequencing approaches to complete sequencing of large molecules such as that required in genomic DNA sequencing projects. However, both procedures would benefit if the usable read length of contiguous DNA was expanded from the current 500-800 nt which can be effectively sequenced by the Sanger method. For example, directed sequencing could be significantly improved by reducing the need for high resolution maps, which could be achieved by longer read lengths, which in turn would permit greater distances between landmarks.
A major limitation of current sequencing procedures is the high error rate (Kristensen, T., et al, DNA Sequencing, 2:243-346, 1992; Kurshid, F. and Beck, S., Analytical Biochemistry, 208:138-143, 1993; Fichant, G. A. and Quentin, Y., Nucleic Acid Research, 23:2900-2908, 1995). It is well known that many of the errors associated with the Maxam-Gilbert and Sanger procedures are systematic; i.e., the errors are not random; rather, they occur repeatedly. To avoid this, two mechanistically different sequencing methods may be used so that the systematic errors in one may be detected and thus corrected by the second and visa versa. Since a significant fraction of the cost of current sequencing methods is associated with the need for high redundancy to reduce sequencing errors, the use of two procedures can reduce the overall cost of obtaining highly accurate DNA sequence.
The production and/or chemical cleavage of polynucleotides composed of ribonucleotides and deoxyribonucleotides has been previously described. In particular, mutant polymerases that incorporate both ribonucleotides and deoxyribonucleotides into a polynucleotide have been described. Production of mixed ribo- and deoxyribo-containing polynucleotides by polymerization has been described; and generation of sequence ladders from such mixed polynucleotides, exploiting the well known lability of the ribo sugar to chemical base, has been described.
The use of such procedures, however, have been limited to: (i) polynucleotides where one ribonucleotide and three deoxyribonucleotides are incorporated; (ii) cleavage at ribonucleotides is effected using chemical base, (iii) only partial cleavage of the ribonucleotide containing polynucleotides is pursued, and (iv) the utility of the procedure is confined to production of sequence ladders, which are resolved electrophoretically.
In addition, the chemical synthesis of polynucleotide primers containing a single ribonucleotide, which at a subsequent step is substantially completely cleaved by chemical base, has been reported. The size of a primer extension product is then determined by mass spectrometry or other methods.
An extension of nucleic acid sequence determination is the rapid identification of polymorphisms or sequence variations within polynucleotide regions. Assays for single nucleotide polymorphisms, SNPs, attempt to discriminate between two DNA sequences that differ at a single base position. Hybridization based methods for accomplishing this take advantage of the fact that a probe sequence that is exactly complementary to a test sequence will hybridize stringently in a xe2x80x9cperfectly matchedxe2x80x9d duplex, whereas a probe/test sequence duplex containing one or more mis-matched base-pairs will either not hybridize at all or will hybridize less stringently. Thus, if a probe sequence were complementary to the sequence of one allele of a SNP, the probe would be expected to hybridize more stringently to that allele that to an alternate allele, which carries a single-base mis-match relative to the probe.
A number of different nucleic acid hybridization assays have been described which utilize solid supports. One such group of assays involves oligonucleotide probes that are attached to a solid matrix, such as a microchip, capillary tube, glass-slide or microbead. This method is the subject of U.S. Pat. Nos. 5,858,659; 5,981,176; 6,045,996; 5,578,458 and 5,759,779. SNPs are detected by the difference in stringency of hybridization of the probe to sequences that include the SNP compared to sequences that do not.
An alternative approach to the above is to immobilize the test sequence and then bring it into contact with xe2x80x9cfreexe2x80x9d labeled probe in which case the probe will only hybridize (or will more stringently hybridize) with those test sequences that form a xe2x80x9cperfectly matched duplexxe2x80x9d with it.
Another method for SNP detection uses immobilized oligonucleotide primers, allele specific hybridization, and polymerase extension in the presence of one or more dideoxynucleotide terminating nucleotides. In this method, the dideoxynucleotide terminator is labeled to detect the sequence polymorphic differences in test nucleotide sequence samples and is the subject of U.S. Pat. Nos. 5,610,287 and 6,030,782.
A still further approach to SNP detection using the solid support utilizes methodology involves allele-specific amplification in which primers are designed to specifically amplify sequence variations within the test sequence samples. In this method detection of the method arises by either immobilizing the amplified nucleotide fragments or an allele specific oligonucleotide probe either of which are labeled for ease of detection.
The above methods suffer from difficulties in allele-specific amplification by PCR. These include (i) the inherent limitations of PCR with regard to length of amplification product and background; (ii) primer extension as the result of a mismatched primer-template complex as the result of which the non-matching allele is amplified along with the primer-matching allele; and, (iii) because different DNA samples will be heterozygous at different combinations of nucleotides, different primers and assay conditions must be established for each pair of polymorphic sites that are to be identified.
Typically, for a SNP assay, the test sequence containing the polymorphic site (which can exist as either of two alleles xe2x80x9cA1xe2x80x9d and xe2x80x9cA2xe2x80x9d), is amplified from genomic DNA by PCR, producing a product that is a mixture of fragments amplified from each member of the relevant chromosome pair. The PCR products may be labeled, e.g. by the incorporation of radioactive or fluorescent tags during PCR. The SNP is identified by denaturing the labeled PCR products and hybridizing the mixture to two oligonucleotides, A1p and A2p, which are oligonucleotides probes specific to the respective alleles.
Samples that are homozygous for allele xe2x80x9cA1xe2x80x9d should hybridize more strongly to oligonucleotide Ap, producing a strong hybridization signal. Ideally, there should be little, or no, hybridization to oligonucleotide A2p (because of the single base mis-match), presumably allowing the genotype at the SNP to be identified unambiguously. Similarly, samples from individuals homozygous for allele xe2x80x9cA2xe2x80x9d should hybridize more strongly to oligonucleotide A2p and not to oligonucleotide A1p. Samples from individuals heterozygous at this SNP should hybridize equally to both oligonucleotides, since the PCR products should contain equal amounts of DNA amplified from the two chromosomes carrying the xe2x80x9cA1xe2x80x9d and xe2x80x9cA2xe2x80x9d alleles.
In practice, it is often difficult to design oligonucleotide probes that can reproducibly and robustly discriminate between different SNP alleles in PCR products amplified from genomic DNA and other DNA samples because the hybridization signal from the perfectly-matched duplex may not differ sufficiently from that produced by a duplex carrying a single mis-match. What is needed, then, is a simple, low cost, rapid and robust, yet sensitive and accurate, method for detecting polymorphisms, in particular single nucleotide polymorphisms, in a polynucleotide (DNA or RNA). The present invention provides such a method.
Thus, in one aspect, this invention relates to a method for detecting polymorphism in a polynucleotide, comprising providing a polynucleotide suspected of containing a polymorphism; amplifying a segment of the polynucleotide encompassing the suspected polymorphism wherein amplification comprises replacing one or more natural nucleotide(s), one of which is a nucleotide involved in the suspected polymorphism, at substantially each point of occurrence in the segment with a modified nucleotide or, if more than one natural nucleotide is replaced, with different modified nucleotides to form an amplified modified segment; cleaving the amplified modified segment into fragments by contacting it with a reagent or reagents that cleave(s) the segment at substantially each point of occurrence of the modified nucleotide(s); hybridizing the fragments to an oligonucleotide; and, analyzing the hybridized fragments for an incorporated detectable label identifying the suspected polymorphism.
In another aspect this invention relates to a method for detecting polymorphism in a polynucleotide, comprising amplifying a segment of the polynucleotide encompassing the suspected polymorphism wherein amplification comprises replacing a natural nucleotide that is involved in the suspected polymorphism at substantially each point of occurrence in the segment with a modified nucleotide to form an amplified modified segment; cleaving the amplified modified segment into fragments by contacting it with a reagent or reagents that cleave(s) the segment at substantially each point of occurrence of the modified nucleotide(s); hybridizing the fragments to an oligonucleotide which forms duplexes with the fragments that have different melting temperatures; subjecting the duplexes to a temperature that is above the melting temperature of at least one duplex; and, analyzing the remaining duplexes for an incorporated label identifying the suspected polymorphism.
In another aspect of this invention, the detectable label is incorporated during amplification.
In a further aspect of this invention, incorporating the detectable label during amplification comprises using a detectably labeled primer.
In an aspect of this invention, the detectably labeled primer comprises a radioactive primer or a primer containing a fluorophore.
In a still further aspect of this invention, incorporating the detectable label during amplification comprises using a detectably labeled, modified nucleotide.
In another aspect of this invention, the detectably labeled, modified nucleotide comprises a radioactive modified nucleotide or a modified nucleotide containing a fluorphore.
In an aspect of this invention, the detectably labeled, modified nucleotide is a detectably labeled, modified ribonucleotide.
In an aspect of this invention, the detectably labeled, modified ribonucleotide comprises a radioactive modified ribonucleotide or a modified ribonucleotide containing a fluorophore.
In still another aspect of this invention, incorporating the detectable label during amplification comprises replacing a natural nucleotide, that is different than the natural nucleotide(s) being replaced with a modified nucleotide(s), at one or more point(s) of occurrence in the segment with a detectably labeled nucleotide.
In yet another aspect of this invention, the detectably labeled nucleotide comprises a radioactive nucleotide or a nucleotide containing a fluorophore.
In further aspect of this invention, the detectably labeled nucleotide comprises a detectably labeled ribonucleotide.
In another aspect of this invention, the detectably labeled ribonucleotide comprises a radioactive ribonucleotide or a ribonucleotide containing a fluorophore.
In an aspect of this invention, the detectable label is incorporated during cleavage.
In a further aspect of this invention, incorporating the detectable label during cleavage comprises using detectably labeled tris(carboxyethyl)phosphine (TCEP).
In a still further aspect of this invention, using detectably labeled TCEP comprises using radioactive TCEP or TCEP-containing a fluorophore.
In another aspect of this invention, incorporating the detectable label during cleavage comprises using a detectably labeled secondary amine.
In yet another aspect of this invention, using a detectably labeled secondary amine comprises using a radioactive secondary amine or a secondary amine containing a fluorophore.
In an aspect of this invention, the detectable label is incorporated during hybridization.
In an aspect of this invention, incorporating the detectable label during hybridization comprises hybridizing a second, detectably labeled oligonucleotide to the fragments hybridized to the oligonucleotide.
In a further aspect of this invention, the second, detectably labeled oligonucleotide comprises a radioactive oligonucleotide or an oligonucleotide containing a fluorophore.
In another aspect of this invention, the detectable label is incorporated after cleavage or after hybridization, the method comprising cleaving using a reagent comprising TCEP or a secondary amine; and, substituting the TCEP or secondary amine with a radioactive molecule or a fluorophore after cleavage or after hybridization.
In a further aspect of this invention the polymorphism is selected from the group consisting of a single nucleotide polymorphism (SNP), a deletion or an insertion.
In a still further aspect of this invention, amplifying the segment comprises a polymerase chain reaction (PCR).
In an aspect of this invention, amplifying the segment comprises replacing one natural nucleotide that is involved in the suspected polymorphism at each point of occurrence in the segment with a modified nucleotide to form a modified segment.
In a further aspect of this invention, the above-modified nucleotide comprises a labeled, modified nucleotide.
In another aspect of this invention, the above-labeled modified nucleotide comprises a radioactive modified nucleotide or a modified nucleotide containing a fluorophore.
In an aspect of this invention, the above-modified nucleotide comprises a modified ribonucleotide.
In still another aspect of this invention, the modified nucleotide comprises a labeled, modified ribonucleotide.
In an aspect of this invention, the labeled, modified ribonucleotide comprises a radioactive ribonucleotide or a ribonucleotide containing a fluorophore.
In another aspect of this invention, hybridizing the fragments to an oligonucleotide comprises using an oligonucleotide that is immobilized on a solid support.
In an aspect of this invention, the incorporated detectable label comprises fluorescence resonance energy transfer (FRET).
An aspect of this invention is a compound having the chemical structure: 
wherein R1 is selected from the group consisting of: 
A compound having the chemical structure: 
wherein said xe2x80x9cBasexe2x80x9d is selected from the group consisting of cytosine, guanine, inosine and uracil is another aspect of this invention.
Another aspect of this invention is a compound having the chemical structure: 
wherein said xe2x80x9cBasexe2x80x9d is selected from the group consisting of adenine, cytosine, guanine, inosine and uracil.
A still further aspect of this invention is a compound having the chemical structure: 
wherein said xe2x80x9cBasexe2x80x9d is selected from the group consisting of adenine, cytosine, guanine, inosine, thymine and uracil.
A polynucleotide comprising a dinucleotide sequence selected from the group consisting of: 
wherein each xe2x80x9cBasexe2x80x9d is independently selected from the group consisting of adenine, cytosine, guanine and thymine; W is an electron withdrawing group; and, X is a leaving group is also an aspect of this invention. The electron withdrawing group is selected from the group consisting of F, Cl, Br, I, NO2, Cxe2x89xa1N, xe2x80x94C(O)OH and OH in another aspect of this invention and, in a still further aspect, the leaving group is selected from the group consisting of Cl, Br, I and OTs.
An aspect of this invention is a method for synthesizing a polynucleotide comprising mixing a compound having the chemical structure: 
wherein R1 is selected from the group consisting of: 
with adenosine triphosphate, guanosine triphosphate, and thymidine or uridine phosphate in the presence of one or more polymerases is, too, an aspect of this invention.
A method for synthesizing a polynucleotide comprising mixing a compound having the chemical structure: 
wherein R1 is selected from the group consisting of: 
with adenosine triphosphate, cytidine triphosphate and guanosine triphosphate in the presence of one or more polymerases is also an aspect of this invention.
A method for synthesizing a polynucleotide, comprising mixing a compound having the chemical structure: 
wherein R1 is selected from the group consisting of: 
with cytidine triphosphate, guanosine triphosphate, and thymidine triphosphate in the presence of one or more polymerases is a further aspect of this invention.
It is an aspect of this invention is a method for synthesizing a polynucleotide, comprising mixing a compound having the chemical structure: 
wherein R1 is selected from the group consisting of: 
with adenosine triphosphate, cytidine triphosphate and thymidine triphosphate in the presence of one or more polymerases.
Another aspect of this invention is a method for synthesizing a polynucleotide, comprising mixing a compound selected from the group consisting of:
a compound having the chemical structure: 
xe2x80x83wherein said xe2x80x9cBasexe2x80x9d is selected from the group consisting of cytosine, guanine, inosine and uracil;
a compound having the chemical structure: 
xe2x80x83wherein said xe2x80x9cBasexe2x80x9d is selected from the group consisting of adenine, cytosine, guanine, inosine and uracil; and
a compound having the chemical structure: 
xe2x80x83wherein the xe2x80x9cBasexe2x80x9d is selected from the group consisting of adenine, cytosine, guanine or inosine, and thymine or uracil, with whichever three of the four nucleoside triphosphates, adenosine triphosphate, cytidine triphosphate, guanosine triphosphate and thymidine triphosphate, do not contain said base (or its substitute), in the presence of one or more polymerases.
Another aspect of this invention is a method for synthesizing a polynucleotide, comprising mixing one of the following pairs of compounds: 
wherein:
Base1 is selected from the group consisting of adenine, cytosine, guanine or inosine, and thymine or uracil;
Base2 is selected from the group consisting of the remaining three bases which are not Base1;
R3 is Oxe2x88x92xe2x80x94P(xe2x95x90O)(Oxe2x88x92)xe2x80x94Oxe2x80x94P(xe2x95x90O)(Oxe2x88x92)xe2x80x94Oxe2x80x94P(xe2x95x90O)(Oxe2x88x92)xe2x80x94Oxe2x80x94; and,
W is an electron-withdrawing group;
X is leaving group;
a second W or X shown in parentheses on the same carbon atom means that a single W or X group can be in either position on the sugar or both W or both X groups can be present at the same time;
R is an alkyl group;
with whichever two of the four nucleoside triphosphates, adenosine triphosphate, cytidine triphosphate, guanosine triphosphate and thymidine triphosphate, do not contain base-1 or base-2 (or their substitutes), in the presence of one or more polymerases.
An aspect of this invention is a polymerase that is capable of catalyzing the incorporation of a modified nucleotide into a polynucleotide wherein said modified nucleotide does not contain ribose as its only modifying characteristic. The above polymerase of claim 1 obtained by a process comprising DNA shuffling in another aspect of this invention.
The DNA shuffling including process can comprise the following steps:
a. selecting one or more known polymerase(s);
b. performing DNA shuffling;
c. transforming shuffled DNA into a host cell;
d. growing host cell colonies;
e. forming a lysate from said host cell colony;
f. adding a DNA template containing a detectable reporter sequence, the modified nucleotide or nucleotides whose incorporation into a polynucleotide is desired and the natural nucleotides not being replaced by said modified nucleotide(s); and,
g. examining the lysate for the presence of the detectable reporter.
The DNA-shuffling including process can also comprise:
a. selecting a known polymerase or two or more known polymerases having different structures or different catalyzing capabilities or both;
b. performing DNA shuffling;
c. transforming said shuffled DNA into a host to form a library of transformants in host cell colonies;
d. preparing first separate pools of said transformants by plating said host cell colonies;
e. forming a lysate from each said first separate pool host cell colonies;
f. removing all natural nucleotides from each said lysate;
g. combining each said lysate with:
a single-stranded DNA template comprising a sequence corresponding to an RNA polymerase promoter followed by a reporter sequence;
a single-stranded DNA primer complementary to one end of said template;
the modified nucleotide or nucleotides whose incorporation into said polynucleotide is desired;
each natural nucleotide not being replaced by said modified nucleotide or nucleotides;
h. adding RNA polymerase to each said combined lysate;
i. examining each said combined lysate for the presence of said reporter sequence;
j. creating second separate pools of transformants in host cell colonies from each said first separate pool of host cell colonies in which the presence of said reporter is detected;
k. forming a lysate from each said second separate pool of host cell colonies;
l. repeating steps g, h, I, j, k and l to form separate pools of transformants in host cell colonies until only one host cell colony remains which contains said polymerase; and,
m. recloning said polymerase from said one host cell colony into a protein expression vector.
A polymerase which is capable of catalyzing the incorporation of a modified nucleotide into a polynucleotide, obtained by a process comprising cell senescence selection is another aspect of this invention.
The cell senescence selection process can comprise the following steps:
a. mutagenizing a known polymerase to form a library of mutant polymerases;
b. cloning said library into a vector;
c. transforming said vector into host cells selected so as to be susceptible to being killed by a selected chemical only when said cell is actively growing;
d. adding a modified nucleotide;
e. growing said host cells;
f. treating said host cells with said selected chemical;
g. separating living cells from dead cells; and,
h. isolating said polymerase or polymerases from said living cells.
The cell senescence selection process can also comprise steps including:
a. mutagenizing a known polymerase to form a library of mutant polymerases;
b. cloning said library of mutant polymerases into a plasmid vector;
c. transforming with said plasmid vector bacterial cells that, when growing, are susceptible to an antibiotic,
d. selecting transfectants using said antibiotic;
e. introducing a modified nucleotide, as the corresponding nucleoside triphosphate, into the bacterial cells;
f. growing the cells;
g. adding an antibiotic, which will kill bacterial cells that are actively growing;
h. isolating said bacterial cells;
i. growing said bacterial cells in fresh medium containing no antibiotic;
j. selecting live cells from growing colonies;
k. isolating said plasmid vector from said live cells;
l. isolating said polymerase; and,
m. assaying said polymerase.
Repeating steps c to k of the above process one or more additional times before proceeding to step l is another aspect of this invention.
That the polymerase obtained in the above methods be a heat stable polymerase is another aspect of this invention.
A final aspect of this invention is a kit, comprising:
one or more modified nucleotides;
one or more polymerases capable of incorporating said one or more modified nucleotides in a polynucleotide to form a modified polynucleotide; and,
a reagent or reagents capable of cleaving said modified polynucleotide at each point of occurrence of said one or more modified nucleotides in said polynucleotide.
Table 1 shows the molecular weights of the four DNA nucleotide monophosphates and the mass difference between each pair of nucleotides.
Table 2A shows the masses of all possible 2mers, 3mers, 4mers and 5mers of the DNA nucleotides in Table 1.
Table 2B shows the eight possible sets of isobaric oligonucleotides.
Table 3 hows the masses of all possible 2mers, 3mers, 4mers, 5mers, 6mers and 7mers that would be produced by cleavage at one of the four nucleotides and the mass differences between neighboring oligonucleotides.
Table 4 shows the mass changes that will occur for all possible point mutations (replacement of one nucleotide by another) and the theoretical maximum size of a polynucleotide in which a point mutation should be detectable by mass spectrometry using mass spectrometers of varying resolving powers.
Table 5 shows the actual molecular weight differences observed in an oligonucleotide using the method of this invention; the difference reveals a hitherto unknown variance in the oligonucleotide.
Table 6 shows all of the masses obtained by cleavage of an exemplary 20mer in four separate reactions, each reaction being specific for cleavage at one of the DNA nucleotides; i.e., at A, C, G and T.