The ability to determine nucleic acid sequences is critical for understanding the function and control of genes and for applying many of the basic techniques of molecular biology. Sequencing the human genome and other model organisms was first made possible by the inventions of Sanger et. al. PNAS 74: 5463–5467 (1977) and Maxam et. al. PNAS 74: 560–564 (1977). The Sanger method has seen great advances including automation, but still only 300 to 500 bases can be sequenced under optimum conditions.
Sequencing by hybridization (SBH) is a new and promising approach to DNA sequencing which offers the potential of reduced cost and higher throughput over traditional gel-based approaches. Strezoska, et. al. PNAS USA 88: 10089–10093 (1991) first accurately sequenced 100 base pairs of a known sequence using hybridization techniques, although the approach was proposed independently by several groups, including Bains and Smith, Journal of Theoretical Biology 135:303–307 (1988); Drmnanac and Crkvenjakov U.S. Pat. No. 5,202,231; Fodor et. al. U.S. Pat. No. 5,424,186; Lysov, et al. Dokl. Acad. Sci. USSR 303: 1508-(1988); Macevicz, U.S. Pat. No. 5,002,867; and Southern, European Patent EP 0 373 203 B1 and IPN WO 93/22480. More recently, Crkvenjakov's and Drmanac's laboratories report sequencing a 340 base-pair fragment in a blind experiment (Pevzner and Lipshutz, 19th Int. Conf. Mathematical Foundations of Computer Science, Springer-Verlag LNCS 841 143–158 (1994)). All of the above articles and patents are incorporated herein in their entirety.
The classical sequencing by hybridization (SBH) procedure attaches a large set of single-stranded fragments or probes to a substrate, forming a sequencing chip. A solution of labeled single-stranded target DNA fragments are exposed to the chip. These fragments hybridize with complementary fragments on the chip, and the hybridized fragments can be identified using a nuclear detector or a fluorescent/phosphorescent dye, depending on the selected label. Each hybridization or the lack thereof determines whether the string represented by the fragment is or is not a substring of the target. The target DNA can now be sequenced based on the constraints of which strings are and are not substrings of the target. The surveys Pevzner and Lipshutz, 19th Int. Conf. Mathematical Foundations of Computer Science, Springer-Verlag LNCS 841 143–158 (1994) and Chetverin and Kramer Bio/Technology 12: 1093–1099 (1994) give an excellent overview of the current state of the art in sequencing by hybridization, biologically, technologically, and algorithmically.
Sequencing by hybridization is a useful technique for general sequencing, and for rapidly sequencing variants of previously sequenced molecules. Furthermore, hybridization can provide an inexpensive procedure to confirm sequences derived using other methods.
The most widely used sequencing chip design, the classical sequencing chip C(k), contains all 4k single-stranded oligonucleotides of length k. In C(8) all 48=65,536 octamers are used. The classical chip C(8) suffices to reconstruct 200 nucleotide-long sequences in only 94 of 100 cases (Pevzner, et. al. J. Biomolecular Structure and Dynamics 9: 399–410 (1991)), even in error-free experiments. Unfortunately, the length of unambiguously reconstructible sequences grows slower than the area of the chip. Thus, such exponential growth of the area inherently limits the length of the longest reconstructible sequence by classical SBH, and the chip area required by any single, fixed sequencing array on moderate length sequences will overwhelm the economies of scale and parallelism implicit in performing thousands of hybridization experiments simultaneously when using classical SBH methods.
Other variants of SBH (including nested-strand SBH (Rubinov and Gelfand J. Computational Biology (1995) and positional SBH (Broude, Sano, Smith and Cantor, PNAS (1994)) have been proposed to increase the resolving power of classical SBH, but these methods still require large arrays to sequence relatively few nucleotides.
The algorithmic aspect of sequencing by hybridization arises in the reconstruction of the test sequence from the hybridization data. The outcome of an experiment with a classical sequencing chip C(k) assigns to each of the 4k strings a probability that it is a substring of the test sequence. In an experiment without error, these probabilities will all be 0 or 1, so each k-nucleotide fragment of the test sequence is unambiguously identified.
Although efficient algorithms do exist for finding the shortest string consistent with the results of a classical sequencing chip experiment, these algorithms have not proven useful in practice because previous SBH methods do not return sufficient information to sequence long fragments. One particular obstacle inherent in this method is the inability to accurately position repetitive sequences in DNA fragments. Furthermore, this method cannot determine the length of tandem short repeats, which are associated with several human genetic diseases (Warren S T, Science 1996; 271:1374–1375). These limitations have prevented its use as a primary sequencing method.
Additionally, sequencing by hybridization has so far failed to perform near the theoretical maximum efficiency. For example, the classical probing scheme uses a complete set of all 4k k-nucleotide probes, wherein k is the length of each probe sequence. The set of hybridized probes is then used to construct a directed graph, either a Hamiltonian path or its equivalent Eulerian path. Probabilistic analysis and empirical evidence confirmed that using this method, k-nucleotide probes were adequate to reliably reconstruct sequences of length proportional only to the square root of 4k, rather than to 4k, as information theory predicts. Improvements to this algorithm (e.g., Skiena, U.S. Pat. No. 5,683,881, incorporated herein by reference) have been reported, but the maximum efficiency has been elusive.
A more efficient strategy for sequencing genes by hybridization would be a tremendous boon to the biotechnology industry. For example, the tremendous potential utility of genomic sequencing projects is directly restrained by the speed of the sequencing process itself. Methods which increase the speed and efficiency of DNA sequencing proportionally increase the speed at which such projects can unlock the secrets of evolution and molecular biology.