This invention relates to a method and system of nucleotide sequence determination and mutation detection in a subject nucleic acid molecule for use with automated electrophoresis detection apparatus.
One of the steps in nucleotide sequence determination of a subject nucleic acid polymer is interpretation of the pattern of oligonucleotide fragments which results from electrophoretic separation of fragments of the subject nucleic acid polymer (the "fragment pattern"). The interpretation of the fragment pattern, colloquially known as "base-calling," results in determination of the order of four nucleotide bases, A (adenine), C (cytosine), G (guanine) and T (thymine) for DNA or U (uracil) for RNA in the subject nucleic acid polymer.
In the earliest method of base-calling, a method which is still commonly employed, the subject nucleic acid polymer is labeled with a radioactive isotope and either Maxam and Gilbert chemical sequencing (Proc. Natl. Acad. Sci. USA, 74: 560-564 (1977)) or Sanger et al. chain termination sequencing (Proc. Natl. Acad. Sci. USA 74: 5463-5467 (1977)) is performed. The resulting four samples of nucleic acid fragments (terminating in A, C, G, or T(U) respectively in the Sanger et al. method) are loaded into separate loading sites at the top end of an electrophoresis gel. An electric field is applied across the gel, and the fragments migrate through the gel. During this electrophoresis, the gel acts as a separation matrix. The fragments, which in each sample are of an extended series of discrete sizes, separate into bands of discrete species in a channel along the length of the gel. Shorter fragments generally move more quickly than larger fragments. After a suitable separation period, the electrophoresis is stopped. The gel may now be exposed to radiation sensitive film for the generation of an autoradiograph. The pattern of radiation detected on the autoradiograph is a fixed representation of the fragment pattern. A researcher then manually base-calls the order of fragments from the fragment pattern by identifying the stepwise sequence of the order of bands across the four channels.
More recently, with the advent of the Human Genome Organization and its massive project to sequence the entire human genome, researchers have been turning to automated DNA sequencers to process vast amounts of DNA sequence information. Existing automated DNA sequencers are available from Applied Biosystems, Inc. (Foster City, Calif.), Pharmacia Biotech, Inc. (Piscataway, N.J.), Li-Cor, Inc. (Lincoln, Nebr.), Molecular Dynamics, Inc. (Sunnyvale, Calif.) and Visible Genetics Inc. (Toronto). Automated DNA sequencers are basically electrophoresis apparatuses with detection systems which detect the presence of a detectable molecule as it passes through a detection zone. Each of these apparatus, therefore, are capable of real time detection of migrating bands of oligonucleotide fragments; the fragment patterns consist of a time based record of fluorescence emissions or other detectable signals from each individual electrophoresis channel. They do not require the cumbersome autoradiography methods of the earliest technologies to generate a fragment pattern.
The prior art techniques for computer-assisted base-calling for use in automated DNA sequencers are exemplified by the method of the Pharmacia A.L.F.(tm) sequencer. Oligonucleotide fragments are labeled with a fluorescent molecule such as fluorescein prior to the sequencing reactions. Sanger et al. sequencing is performed and samples are loaded into the top end of an electrophoresis gel. Under electrophoresis the bands of species separate, and a laser at the bottom end of the gel causes the fragments to fluoresce as they pass through a detection zone. The fragment patterns are a record of fluorescence emissions from each channel. In general, each fragment pattern includes a series of sharp peaks and low, flat plains; the peaks representing the passage of a band of oligonucleotide fragments; the plains representing the absence of such bands.
To perform computer-assisted base-calling, the A.L.F. system executes at least four discrete functions: 1) it smooths the raw data with a band-pass frequency filter; 2) it identifies successive maxima in each data stream; 3) it aligns the smoothed data from each of the four channels into an aligned data stream; and 4) it determines the order of the successive maxima with respect to the aligned data stream. The alignment process used in the apparatus depends on the existence of very little variability between the lanes of the gel. In this case, the fragment patterns from each lane can be superimposed by alignment to a presumed starting point in each pattern to provide a record of a continuous, non-overlapping series of sharp peaks, each peak representing a one nucleotide step in the subject nucleic acid. Where a distinct ordering of peaks can not be made, the computer identifies the presence of ambiguities and fails to identify a sequence.
Other published methods of computer-assisted base-calling include the methods disclosed by Tibbetts and Bowling (U.S. Pat. No. 5,365,455) and Dam et al (U.S. Pat. No. 5,119,316) which patents are incorporated herein by reference. Tibbetts and Bowling disclose a method and system which relies on the second derivative of the peak slopes to smooth the data. The second derivative is used to provide an informative variable and an intensity variable to determine the nucleic acid sequence corresponding to the subject nucleic acid polymer. Dam et al. disclose a method of combining peak shapes from two signal spectrums derived from the same electrophoresis channel to determine the order of nucleotides in the subject nucleic acid polymer.
Three practical problems face all existing methods and systems of base-calling. The first is the inability to align shifted lanes of data. If the signal from the related data streams does not begin at approximately the same time, it is difficult, if not impossible, for these techniques to determine the correct alignment. Secondly, it is a challenge to resolve "compressions" in the fragment pattern: those anomalies wherein the signal from two or more nucleotides in a row are not distinguishably separated as compared to other nucleotides in the general vicinity. Compressions result most often from short hairpin loops at the end of a fragment which cause altered gel mobility features. The third problem is the inability to identify nucleotide sequences beyond the limits of single nucleotide resolution. Larger fragments tend to need longer electrophoresis runs to separate into discrete bands of fragments, in part because a one nucleotide addition to a 300 nt fragment is less significant than a one nucleotide addition to a 25 nt fragment. The limit of resolution is reached when individual bands can not be usefully distinguished.
All of these problems limit the most crucial aspects of base-calling, which are speed, read-length and accuracy. Read-length is the number of fragment bands which can be identified from the fragment pattern. Greater read-length provides greater information about the DNA sequence in question. Accuracy measures the number of base-calling errors. Frequent errors are unacceptable since they alter the biological meaning of the DNA sequence in question. And, as described below, if DNA sequence determination is to be used as a tool for diagnostic purposes, base-calling errors can lead to misdiagnosis.
The advent of DNA sequence-based diagnosis provides new opportunities for improved speed, accuracy and read-length in computer-assisted base-calling. DNA sequence-based diagnosis is the routine sequencing of patient DNA to identify genotype and/or specific gene sequences of the patient, wherein the DNA sequence is reported back to the physician and patient in order to assist in diagnosis and treatment of patient conditions. One of the great advantages of DNA sequence-based diagnosis is that the DNA sequence being examined is largely known. As demonstrated by the instant invention, it is possible to use the known fragment pattern for each DNA sequence to assist in the interpretation of the fragment pattern obtained from a patient sample to obtain improved read-length and accuracy. It can also be used to increase the speed of sample analysis.
It is an object of the instant invention to provide a method and system for nucleotide sequence determination and mutation detection which can be used with DNA sequence-based diagnosis.
It is a further object of the instant invention to provide a method and system for nucleotide sequence determination and mutation detection when the fragment pattern demonstrates localized compressions.
It is a further object of the instant invention to provide a method and system for nucleotide sequence determination and mutation detection when the fragment pattern does not provide single nucleotide resolution.
It is a further object of the instant invention to provide a method and system of computer-assisted base-calling which can be used with fragment pattern records from high speed electrophoretic separations which demonstrate less than ideal separation characteristics.