The rate of determining the sequence of the four nucleotides in nucleic acid samples is a major technical obstacle for further advancement of molecular biology, medicine, and biotechnology. Nucleic acid sequencing methods which involve separation of nucleic acid molecules in a gel have been in use since 1978.
The traditional method of determining a sequence of nucleotides (i.e., the order of the A, G, C and T nucleotides in a sample) is performed by preparing a mixture of randomly-terminated, differentially labelled nucleic acid fragments by degradation at specific nucleotides, or by dideoxy chain termination of replicating strands. Resulting nucleic acid fragments in the range of 1 to 500 bp are then separated on a gel to produce a ladder of bands wherein the adjacent samples differ in length by one nucleotide.
The present invention relates to an alternative methodology for sequencing a target nucleic acid known as sequencing by hybridization (SBH). The array-based approach of SBH does not require single base resolution in separation, degradation, synthesis or imaging of a nucleic acid molecule. Using mismatch discriminative hybridization of short oligonucleotides K nucleotides in length, lists of constituent k-mer oligonucleotides may be determined for target nucleic acid. Sequence for the target nucleic acid may be assembled by uniquely overlapping scored oligonucleotides.
Nucleic acid sequencing by hybridization shares interesting parallels with conducting a computer search of a text file for a particular word or a phrase. In each case, a large string of characters is probed with a specific shorter string to detect matching sequences. In a computer text search, the search string or strings (key words) are used to browse a large Internet or local data base to identify the subset of specific documents containing perfect sequence matches, which is then retrieved for further review or analysis. In SBH, oligonucleotide probes ranging from 4 to 25 characters in length are used to browse libraries of nucleic acid segments to identify nucleic acid molecules containing exact complementary sequences. These molecules may then be further analyzed by mapping or clustering, or by partial or full sequencing.
In the case of a hybridization search of four simple DNA samples with four different 5-mer probes (which could be called key words, or strings), each sample binds a different combination of probes, leading to a characteristic hybridization pattern. Each positive binding (or hybridization) event in a given DNA sample provides a discrete piece of information about its sequence. Neither the frequency nor location of the string within the DNA molecule is obtained from a hybridization search, as is also the case in most computer text searches. For example, a positive search result for the word “tag” in a set of document titles does not identify whether the word is positioned at the beginning, middle, or end of the selected titles, nor whether it occurs once, twice, or many times in any of these titles. Similarly, the entire DNA is sampled by random probe-binding trials, without determination of exactly where in the chain particular probes bind.
In a computer search of English language text, the complexity of the English alphabet (26 letters) generally allows a meaningful search of a given text to be done with one or a few specific words. With a DNA search, the simple four-letter genetic alphabet requires use of either more or longer “words” (strings) to precisely identify a specific DNA. A simple word like “cat” might yield useful results in a computer search of the Internet, but the genetic triplet “CAT” occurs far too frequently (about once in every sixty-four triplets) to be of much use in DNA identification. The lengths of the DNA string (sequence) and the probe (interrogating string) are important parameters in devising a successful SBH experiment. By choosing appropriate probe and sample lengths, a researcher can obtain useful sequence data.
The first potential probe binding site in a nucleotide sequence chain starts at the first base and extends for the length of the probe. The second probe binding site starts at the second base and overlaps the first probe binding site, less one base. This means that if a complete (or sufficient) set of probes is tested, the end of each positive probe overlaps with the beginning of another positive probe, except in the case of the last positive probe in the target. In each sequence assembly cycle, four potential overlap probes are checked. Starting with a positive probe AAATC, the next positive overlapping probe to the right may be AATCA, AATCC, AATCG or AATCT. Of these probes, only AATCG is found to be positive and is used for further assembly. The cycles are repeated in both directions until all positive probes are incorporated and the complete sequence is assembled. By extension, the same process applies to a longer target nucleic acid if enough probes of appropriate length are used to identify uniquely overlapped strings within it.
The use of overlapping positive probes is a key aspect of SBH methods. This “overlap principle” allows the identification of sequences within a target DNA that are longer than any of the probes used in the assembly process. Probe overlap allows indirect assignment of one out of four bases for each position in the analyzed DNA chain without performing any actual positional measurements on the sample. The base/position information is in fact derived from the known sequences of the oligonucleotide probes obtained by accurate chemical synthesis.
Thus, a DNA hybridization search is effectively a highly parallel molecular computation process with fully random access to the “input data,” in this case a polynucleotide chain that may be thousands of bases long. These fundamental characteristics of the SBH process confer unique opportunities for miniaturization and parallel analyses, leading to speed and cost efficiencies not available with other sequencing methods.
Because the sequences of DNA molecules are non-random and irregular, statistical artifacts arise that must be addressed in SBH experiments. Even when the lengths of DNA targets and probes are selected to achieve a statistical expectation that each probe sequence occurs no more than once in the target, so-called “branching ambiguities” can occur. (Drmanac et al., Yugoslav Patent Application 570/87 (1987) issued as U.S. Pat. No. 5,202,231 (1993); Drmanac et al., “Sequencing of Megabase Plus DNA by Hybridization: Theory of the Method,” Genomics, 4:114-128 (1989).) Take the case of three probes that positively hybridize to a target DNA: TAGA, AGAC and AGAT. Both the second and the third probes overlap with the first probe, sharing the bases AGA and giving extended sequences TAGAC and TAGAT, respectively. Due to the occurrence of the sequence AGA in both the second and third probes (e.g. due to double AGA occurrence in the target), there is not enough information available to decide which of the two probes is actually the one that overlaps with the first probe in the sample. Sequence assembly can thus proceed along either of the two branches, only one of which may be correct. Branching ambiguities may be resolved if a reference sequence for the target is known.
By using all possible probes of a given length, a researcher can unambiguously determine a target nucleotide sequence, provided the target nucleic acid is short enough that most overlap sequences occur no more than once. The only other exception to this rule is tandem repeat regions (e.g.: AAAAAAAAAA (SEQ ID NO: 1), ACACACACAC (SEQ ID NO: 2)) that are longer than the probe length. In such cases, the exact length of these repeats may be determined by use of a special subset of longer probes. Longer targets may require longer probes for unambiguous sequence determination. A variety of ways have been proposed to increase the read length with a given set of probes, or to reduce the number of experimental probe/target scores needed to sequence a target nucleic acid. These include the use of redundant combinations of probes, competitive hybridization and overlapped clones (Drmanac et al., Yugoslav Patent Application 570/87 (1987) issued as U.S. Pat. No. 5,202,231 (1993); Drmanac et al., “Sequencing of Megabase Plus DNA by Hybridization: Theory of the Method,” Genomics, 4:114-128 (1989)), gapped probes (Bains et al., “A Novel Method for Nucleic Acid Sequencing,” J. Theor. Biol., 135:303-307 (1988)) and binary probes (Pevzner et al., “Towards DNA Sequencing Chips,” Mathematical Foundations of Computer Science 1994 (Eds. I. Privara, B. Rovan, P. Ruzicka,) pp. 143-158, The Proceedings of 19th International Symposium, MFCS '94, Kosice, Slovakia, Springer-Verlag, Berlin (1995)), continuous stacking hybridization (Khrapko et al., “An Oligonucleotide Hybridization Approach to DNA Sequencing,” FEBS Letters, 256:118-122 (1989), and the simultaneous sequencing of similar genomes (Drmanac et al., “Sequencing by Hybridization (SBH) With Oligonucleotide Probes as an Integral Approach for the Analysis of Complex Genomes,” International Journal of Genomic Research, 1(1): 59-79 (1992).
There are several approaches available to achieve sequencing by hybridization. In a process called SBH Format 1, nucleic acid samples are arrayed, and labeled probes are hybridized with the samples. Replica membranes with the same sets of sample nucleic acids may be used for parallel scoring of several probes and/or probes may be multiplexed (i.e., probes containing different labels). Nucleic acid samples may be arrayed and hybridized on nylon membranes or other suitable supports. Each membrane array may be reused many times. Format 1 is especially efficient for batch processing large numbers of samples.
In SBH Format 2, probes are arrayed at locations on a substrate which correspond to their respective sequences, and a labelled nucleic acid sample fragment is hybridized to the arrayed probes. In this case, sequence information about a fragment may be determined in a simultaneous hybridization reaction with all of the arrayed probes. For sequencing other nucleic acid fragments, the same oligonucleotide array may be reused. The arrays may be produced by spotting or by in situ synthesis of probes.
In Format 3 SBH, two sets of probes are used. In one embodiment, a set may be in the form of arrays of probes with known positions in the array, and another, labelled set may be stored in multiwell plates. In this case, target nucleic acid need not be labelled. Target nucleic acid and one or more labelled probes are added to the arrayed sets of probes. If one attached probe and one labelled probe both hybridize contiguously on the target nucleic acid, they can be covalently ligated, producing a detected sequence equal to the sum of the length of the ligated probes. The process allows for sequencing long nucleic acid fragments, e.g. a complete bacterial genome, without nucleic acid subcloning in smaller pieces.
However, to sequence long nucleic acids unambiguously, SBH involves the use of long probes. As the length of the probes increases, so does the number of probes required to generate sequence information. Each 2-fold increase in length of the target requires a one-nucleotide increase in the length of the probe, resulting in a four-fold increase in the number of probes required (the complete set of probes of length K contains 4k probes). For example, de novo sequencing without additional mapping information of 100 nucleotides of DNA requires 16,384 7-mers; sequencing 200 nucleotides requires 65,536 8-mers; 400 nucleotides, 262,144 9-mers; 800 nucleotides, 1,048,576 10-mers; 1600 nucleotides, 4,194,304 11-mers; 3200 nucleotides, 16,777,216 12-mers; 6400 nucleotides, 67,108,864 13-mers; and 12,800 nucleotides requires 268,435,456 14-mers.
From any given sequence, however, most of the probes will be negative, and thus much of the information is redundant. For sequencing a 200 bp target nucleic acid with 65,536 8-mers, for example, about 330 measurements (positive and negative) are made for each base pair (65,536 probe measurements/200 bp). For sequencing a 6400 bp sequence with 67,108,864 13-mer probes, the measurement redundancy increases to about 10,500. An improvement in SBH that increases the efficiency and reduces the number of necessary measurements would greatly enhance the practical ability to sequence long pieces of DNA de novo. Such an improvement would, of course, also enhance resequencing and other applications of SBH.
Of interest are disclosures of the use of “binary” pools [see Pevzner and Lipschutz, in Mathematical Foundations of Computer Science 1994, Springer-Verlag, Berlin, pages 143-158 (1995?)], “alternating” probes [Pevzner and Lipschutz, supra], “gapped” probes [Pevzner and Lipschutz, supra; Bains and Smith, J. Theor. Biol., 135:303-307 (1988)], redundant combinations (pools) of probes [Drmanac et al., U.S. Pat. No. 5,202,231], probes with degenerate ends in SBH [Bains, Genomics, 11:294-301 (1991)]. See also pools of multiplexed probes [Drmanac and Crkvenjakov, Scientia Yugoslavica, 16(1-2):97-107 (1990)].
Also of interest is the suggestion in WO 95/09248 suggests that extension of the sequence of probe X may be carried out by comparing signals of (a) the four possible overlapping probes generated by a one base extension of the sequence of X and (b) the three single mismatch probes wherein the mismatch position is the first position of X, and adding a base extension only if probe X and the probe created by the base extension have a significantly positive signal compared to the other six probes.