Nucleic acid sequencing and in particular DNA sequencing is essential to the practice of biotechnology, genetic engineering and many other disciplines that rely on the need to determine the genetic information contained in DNA. The sequencing of DNA (herein termed “DNAS”) is the process of determining the sequence of nucleotides that comprise a strand of DNA or can be used to identify the type of nucleotide at one or more specific positions. A nucleotide usually consists of a pentose sugar, a phosphate and 1 of 4 possible nitrogenous bases, denoted A for adenine, G for guanine, C for cytosine, and T for thymine. The sequence of these bases uniquely describes each piece of DNA. DNAS is a crucial step in genetic engineering and biotechnology, since it provides the precise code of genetic information contained in a sample of DNA.
DNA is typically double stranded and hence, the term base pairs is often used, since each base of one strand is opposed by its complimentary base on the other strand. There are an enormous number of bases that need to be sequenced in order to read a piece of DNA. Even a simple piece of DNA from a bacteria cell would likely comprise several thousand bases.
DNA sequencing is traditionally a very labor intensive process. Much has been written about DNA sequencing and genetic engineering and the reader is referred to the many references on this subject, which will provide additional background information.
Two methods of DNA sequencing have been developed. The first is by Maxam and Gilbert (1977) and is described in Proc. Natl. Acad. Sci. USA, Vol. 74, page 560. The second method is described in Proc. Natl. Acad. Sci. USA, by Sanger et al., (1977), Vol. 74, page 5463. The Sanger method involves the generation of DNA fragments by the enzymatic extension of a small piece of DNA called a primer. The primer is extended following the addition of the appropriate bases by an enzyme called polymerase. The sequencing reaction includes bases that permit DNA extension (CEB) and bases that have been chemically modified to terminate DNA extension (CTB). Termination of DNA extension results in the generation of a DNA fragment. The sequencing reaction contains many copies of DNA and is a dynamic system of DNA extension and DNA termination, where at the same site on any strand of DNA a CEB or a CTB is added. This results in the generation of large numbers of pools of fragments where each pool differs in length by a single base.
Once the generation of fragments has been completed the resultant mixture of DNA fragments need to be separated and analyzed. The task of separating the fragments by size to determine what order they are in can be performed by a number of well known techniques. The first methods of manual DNA sequencing utilized polyacrylamide gel electrophoresis techniques to separate the fragments. Polyacrylamide gels have the ability to resolve fragments with a resolution of one base pair, and that resolution is necessary for sequencing. Each fragment is labeled with a radioactive element that typically gives off a beta particle, such as radioactive phosphorus (“32P”). Each of the four samples are then separated in size in their own lane in the gel. The four lanes are typically side by side. After electrophoresis, a piece of x-ray film is placed next to the gel for a number of hours, often a couple of days, to expose the film with the radioactive emissions from the 32P. When developed, the fragments show up as dark bands on the film and the sequence can then be read from the order in which the bands appeared, from the bottom to the top of the film.
Automating DNAS involves automating the process of detecting the fragments on the electrophoresis medium (e.g. a gel) and then automatically determining the DNA base sequence from the sequence of detected fragments using the above algorithm implemented in a microprocessor. Because of the time needed to expose the x-ray film to the β radiation of the 32P, and other considerations involving the use of radioisotopes, new methods of tagging and sequencing based on fluorescence were developed. See, for example, Biophysical and Biochemical Aspects of Fluoresene Spectroscopy, edited by T. Gregory Dewey, Plenum Press, 1997; “Large Scale and Automated Sequence Determination,” by T. Hunkspillar et al., (1991), Science, Vol. 254, pages 59-67 and “DNA Sequencing: Present Limitations and Prospects for the Future,” Barrell, (1991), FASEB Journal, Vol. 5, page 40-45.
Fluorescence tagging of the fragments involves the attachment of a fluorescent compound, or fluorophore, to each fragment analogously to the attachment of the radioactive label to each fragment. These fluorescence labels were found to not adversely affect the process of gel electrophoreses or sequence.
Fluorescence is an optical method that involves stimulating the fluorescent molecule by shining light on it at an optical wavelength that is optimum for that fluorescent molecule. Fluorescent light is then given off by the molecule at a characteristic wavelength that is typically slightly longer than the stimulation wavelength. By focusing the light at the stimulating wavelength down to a point on the gel and then detecting the presence of any optical radiation at the characteristic wavelength of light from the fluorescent molecule, the presence at that point of fragments of DNA tagged with that fluorescent molecule may be determined.
Two methods of implementing an automated DNA sequencing instrument are known in the art. One, reported by Smith et al., (1986), Nature, Vol 321, pages 674-679, puts a different fluorescent tag on each of the four samples of fragments described above. Thus, the sample of fragments that end in the base A are tagged by one fluorophore; the sample of fragments that end in the base G are tagged by another fluorophore, and so on for the other two samples. Each fluorophore can be distinguished by its own stimulation and emission wavelengths of light.
In the Smith et al. method, all four samples are electrophoresed in the same lane together and the differences in their tags are used to distinguish them. That has the advantage that four separate lanes are not used, since the progression of fragments in different lanes is often not consistent with one another and difficulties often arise in determining the sequence as a result.
Another method, reported by Ansor et al., (1986), J. Biochem Biophys. Methods, Vol. 13, pages 315-323 and Nucleic Acids Res., Vol 15(11), pages 4593-4602 (1987), uses one fluorescent tag for all fragments, but employs four separate lanes of gel electrophoresis in a manner that is similar to radioactive labeled sequencing. That approach has the potential disadvantage that four lanes, with different fragment migration rates caused by local temperature variations and other inconsistencies within the gel, could limit the reliability of the sequence determination.
Fluorescence tagging and the detection of natural fluorescence in molecules is a method of analytical chemistry and biology that is well known in the art. The methods described above have been developed for DNA sequencing by the creation of fluorescent tags that can be bound to fragments of DNA. The instruments used to detect fluorescence consist of the following parts. A light source with a broad optical bandwidth, such as a light bulb, or a laser is used as the source of the stimulating light. An optical filter is used to select the light at the desired stimulation wavelength and beam it onto the sample. Optical filters are available at essentially any wavelength and are typically constructed by the deposition of layers of thin film at a fraction of the wavelength of the desired transmission wavelength. The light that exits the optical filter is then applied to the sample to stimulate the fluorescent molecule.
The molecule then emits light at its characteristic fluorescent wavelength. This light is collected by a suitable lens and is then passed through a second optical filter centered at the characteristic wavelength before being brought to a detection device such as a photomultiplier tube, a photoconductive cell, or a semiconductor optical detector. Therefore, only light at the desired characteristic wavelength is detected to determine the presence of the fluorescent molecule.
Whichever automatic DNAS system is used the data generated is analyzed by the computer software of the DNA sequencer to produce a signal, which takes the form of a series of peaks for each of the 4 different colors where each color represents a particular nucleotide base type. The heights of the peaks are rarely uniform and are proportional to the number of fragments in the DNA fragment pool. This is in turn proportional to the amount of DNA that is being sequenced and the rate at which unlabelled nucleotides are incorporated relative to the rate at which labeled nucleotides are incorporated into the extending DNA chain. The scientist or technician has the choice of checking these data to ensure the base calling by the automated sequencer has been performed correctly.
Most DNAS applications involve the identification of sequences of anonymous DNA such as in for example the Human Genome Project. DNAS has also been used to study evolution and population migration by studying sequence diversity of the same region within different individuals of the same or different species. Clinically, DNAS has been used for the detection of mutations in cancer studies and for the detection of viral mutations associated with resistance to anti-viral drugs. One of the most common applications of DNAS is tissue typing, where the genetic matching of tissue types between donors and recipients is critical to the success of transplantation.
For many sequencing based typing applications, DNA from two chromosomes from an individual are sequenced together. At most positions the sequence at the same position on both chromosomes is identical resulting in a single peak (homozygous). However at some positions the sequence is different between the two chromosomes resulting in two peaks at the same position (heterozygous). Each peak is reduced in height compared to when each base is present as homozygous. It is the accurate identification of both bases when they are present at the same position that remains the impediment to widespread use of DNAS for clinical application.
Consequently, there is a need for a method of discriminating between homozygous and heterozygous sequence generated by automatic sequencers. Moreover, there is a need for a method that increases the base calling accuracy for heterozygous sequence and improves the ability to detect low level mutations thereby enabling the quantitation of mutations.
The method of detecting DNA variation in sequence data described in WO/03102211 compares a sequence trace of a reference sequence with the traces of sample sequences, performs an analysis to identify the differences between the two and provides a trace that contains only the difference between the two traces. A disadvantage of this method is that it requires the reference trace sequence and is often inaccurate.