The invention pertains to methods for determining the order of a set of subsequences, and more particularly, a method for determining the sequence of a series of nucleic acids by ordering a collection of probes.
The ability to determine nucleic acid sequences is critical for understanding the function and control of genes and for applying many of the basic techniques of molecular biology. Sequencing the human genome and other model organisms was first made possible by the inventions of Sanger et. al. PNAS 74: 5463-5467 (1977) and Maxam et. al. PNAS 74: 560-564 (1977). The Sanger method has seen great advances including automation, but still only 300 to 500 bases can be sequenced under optimum conditions.
Sequencing by hybridization (SBH) is a new and promising approach to DNA sequencing which offers the potential of reduced cost and higher throughput over traditional gel-based approaches. Strezoska, et.al. PNAS USA 88: 10089-10093 (1991) first accurately sequenced 100 base pairs of a known sequence using hybridization techniques, although the approach was proposed independently by several groups, including Bains and Smith, Journal of Theoretical Biology 135:303-307 (1988); Drmanac and Crkvenjakov U.S. Pat. No. 5,202,231; Fodor et. al. U.S. Pat. No. 5,424,186; Lysov, et al. Dokl. Acad. Sci. USSR 303: 1508- (1988); Macevicz, U.S. Pat. No. 5,002,867; and Southern, European Patent EP 0 373 203 B 1 and IPN WO 93/22480. More recently, Crkvenjakov""s and Drmanac""s laboratories report sequencing a 340 base-pair fragment in a blind experiment (Pevzner and Lipshutz, 19th Int. Conf. Mathematical Foundations of Computer Science, Springer-Verlag LNCS 841 143-158 (1994)). All of the above articles and patents are incorporated herein in their entirety.
The classical sequencing by hybridization (SBH) procedure attaches a large set of single-stranded fragments or probes to a substrate, forming a sequencing chip. A solution of labeled single-stranded target DNA fragments are exposed to the chip. These fragments hybridize with complementary fragments on the chip, and the hybridized fragments can be identified using a nuclear detector or a fluorescent/phosphorescent dye, depending on the selected label. Each hybridization or the lack thereof determines whether the string represented by the fragment is or is not a substring of the target. The target DNA can now be sequenced based on the constraints of which strings are and are not substrings of the target. The surveys Pevzner and Lipshutz, 19th Int. Conf. Mathematical Foundations of Computer Science, Springer-Verlag LNCS 841 143-158 (1994) and Chetverin and Kramer Bio/Technology 12: 1093-1099 (1994) give an excellent overview of the current state of the art in sequencing by hybridization, biologically, technologically, and algorithmically.
Sequencing by hybridization is a useful technique for general sequencing, and for rapidly sequencing variants of previously sequenced molecules. Furthermore, hybridization can provide an inexpensive procedure to confirm sequences derived using other methods.
The most widely used sequencing chip design, the classical sequencing chip C(k), contains all 4k single-stranded oligonucleotides of length k. In C(8) all 48 =65,536 octamers are used. The classical chip C(8) suffices to reconstruct 200 nucleotide-long sequences in only 94 of 100 cases (Pevzner, et.al. J. Biomolecular Structure and Dynamics 9: 399-410 (1991)), even in error-free experiments. Unfortunately, the length of unambiguously reconstructible sequences grows slower than the area of the chip. Thus, such exponential growth of the area inherently limits the length of the longest reconstructible sequence by classical SBH, and the chip area required by any single, fixed sequencing array on moderate length sequences will overwhelm the economies of scale and parallelism implicit in performing thousands of hybridization experiments simultaneously when using classical SBH methods.
Other variants of SBH (including nested-strand SBH (Rubinov and Gelfand J. Computational Biology (1995) and positional SBH (Broude, Sano, Smith and Cantor, PNAS (1994)) have been proposed to increase the resolving power of classical SBH, but these methods still require large arrays to sequence relatively few nucleotides.
The algorithmic aspect of sequencing by hybridization arises in the reconstruction of the test sequence from the hybridization data. The outcome of an experiment with a classical sequencing chip C(k) assigns to each of the 4k strings a probability that it is a substring of the test sequence. In an experiment without error, these probabilities will all be 0 or 1, so each k-nucleotide fragment of the test sequence is unambiguously identified.
Although efficient algorithms do exist for finding the shortest string consistent with the results of a classical sequencing chip experiment, these algorithms have not proven useful in practice because previous SBH methods do not return sufficient information to sequence long fragments. One particular obstacle inherent in this method is the inability to accurately position repetitive sequences in DNA fragments. Furthermore, this method cannot determine the length of tandem short repeats, which are associated with several human genetic diseases (Warren S T, Science 1996; 271:1374-1375). These limitations have prevented its use as a primary sequencing method.
Additionally, sequencing by hybridization has so far failed to perform near the theoretical maximum efficiency. For example, the classical probing scheme uses a complete set of all 4k-nucleotide probes, wherein k is the length of each probe sequence. The set of hybridized probes is then used to construct a directed graph, either a Hamiltonian path or its equivalent Eulerian path. Probabilistic analysis and empirical evidence confirmed that using this method, k-nucleotide probes were adequate to reliably reconstruct sequences of length proportional only to the square root of 4k, rather than to 4k, as information theory predicts. Improvements to this algorithm (e.g., Skiena, U.S. Pat. No. 5,683,881, incorporated herein by reference) have been reported, but the maximum efficiency has been elusive.
A more efficient strategy for sequencing genes by hybridization would be a tremendous boon to the biotechnology industry. For example, the tremendous potential utility of genomic sequencing projects is directly restrained by the speed of the sequencing process itself. Methods which increase the speed and efficiency of DNA sequencing proportionally increase the speed at which such projects can unlock the secrets of evolution and molecular biology.
The systems and methods described herein relate to the sequencing of nucleotide sequences using probes comprising a pattern of universal and designate nucleotides. Such probes are referred to herein as xe2x80x98gapped probesxe2x80x99 to reflect the sequence gaps created by the universal nucleotides. A universal nucleotide, as the term is used herein, describes a chemical entity which, when present in the probe, will engage in a base-pairing relationship with any natural nucleotide. Exemplary universal nucleotides include 5-nitroindole and 3-nitropyrrole, although other universal nucleotides useful for the systems and methods described herein will be known to those of skill in the art. A universal nucleotide is represented herein as U, and a designate nucleotide, e.g., A, C, G, or T, is represented as X.
Although the pattern may comprise any sequence of designate and universal nucleotides, in certain systems, the pattern is an iterative pattern, i.e., a pattern which alternates a predetermined number of universal nucleotides with a predetermined number of designate nucleotides. Exemplary gapped probes may be defined in terms of the two variables and r, wherein s represents the number of nucleotides in a designate nucleotide sequence of the probe, and r represents the number of iterations in the pattern, each iteration of length s and comprising a string of (s-1) universal nucleotides followed by a single designate nucleotide. For example, an (s,r)-probe wherein s is 2 and r is 3, i.e., a (2,3)-probe, would comprise the pattern XXUXUXUX. The contiguous sequence of designate nucleotides in a gapped probe as described herein is referred to as the root. In the exemplary probe above, the root is XX. The length of the root of a gapped probe as described herein is represented by the variable s. A designate nucleotides, or sequence of designate nucleotides, following the first string of one or more universal nucleotides following the root is referred to herein as the first segment. In the exemplary probe above, the first segment has been underlined (X). A designate nucleotides, or sequence of designate nucleotides, following a string of one or more universal nucleotides following the first segment is referred to herein as the second segment. In the exemplary probe above, the second segment has been underlined twice (X). Further segments are numbered in an analogous manner. The last designate nucleotide in the probe, typically the last nucleotide in the probe, is referred to herein as the last segment. The terms employed herein are provided to describe with clarity the exemplary gapped probe XXUXUXUX, given above, wherein the root is followed by a first and last segment. However, it will be understood that in other embodiments, the contiguous sequence that forms the probe can have an alternate pattern, such as for example, wherein the root occurs within the middle, or generally the middle, of the sequence, or alternatively, when the root occurs at the end of the sequence. These alternate probe embodiments can similarly be employed for sequencing, and the techniques disclosed herein for employing these probes to order a Spectrum of hybridized probes, can be practiced with any of these probe embodiments.
The systems and methods described herein further pertain to sequencing chips carrying a set of gapped probes. A set of gapped probes, as the term is used herein, refers to a collection of probes having the same generic probe sequence, e.g., at least ten instances of the generic probe sequence. A generic probe sequence describes a pattern of designate and universal nucleotides, e.g., XXXXUUXUXX. An instance of a generic probe sequence is a sequence of designate and universal nucleotides which conforms to the pattern of the generic probe sequence, e.g., TCTAUUGUCG and GTATUUCUAG are instances of the generic probe sequence XXXXUUXUXX. In certain embodiments, a set of gapped probes comprises probes representing every instance of the designate nucleotides of the generic probe sequence.
The systems and methods described herein also relate to a process for sequencing nucleic acid sequences using gapped probes. Such a process may include providing a set of gapped probes of length k wherein the designate nucleotides vary among the set in a predetermined fashion and wherein the generic probe sequence requires a designate nucleotide at the mth position and the kth position, determining the spectrum of probes in the set of probes which hybridize with a test sequence, analyzing the spectrum of probes, and determining the sequence of the test sequence. The process may further include attaching a primer to the test sequence. Analyzing the spectrum of probes may comprise selecting probes from the spectrum whose first k-1 designate nucleotides correspond to the last k-1 designate nucleotides of the probing pattern positioned at the end of the growing sequence, matching these probes with the growing sequence to determine the next nucleotide in the growing sequence, and repeating the steps of selecting and matching until matching is no longer possible. Analyzing the spectrum of probes may further comprise selecting probes from the spectrum whose first m-1 nucleotides correspond to the last m-1 nucleotides of the growing sequence, matching these probes with the growing sequence to determine the next nucleotide, and repeating the steps of selecting and matching until conclusive matching is no longer possible. Analyzing the spectrum of probes may further comprise selecting a first probe, selecting probes from the spectrum which have a root of length s whose first s-1 nucleotides correspond to the last s-1 nucleotides of the first probe, matching these probes with the growing sequence to determine the next nucleotide, and repeating the steps of selecting and matching until conclusive matching is no longer possible.
Optionally, if a step of matching provides two or more possibilities for the next nucleotide, two or more growing sequences may be established corresponding to each of the possibilities for the next nucleotide. These alternate sequences may then be subjected to the above analysis, whereby the incorrect sequences may terminate rapidly as being unsupported by the spectrum.
The systems and methods described herein further comprise a computer program capable of analyzing a spectrum of probes comprising a natural nucleotide sequence and a pattern of universal and natural nucleotides to determine the sequence of the test sequence, e.g., by the method described above, and a disk, CD, or other storage device which contains such a program.