Sequencing by hybridization (SBH) is a method for sequencing a polynucleotide such as a DNA molecule (Bains & Smith 1988, Lysov et al. 1988, Southern 1988, Drmanac and Crkvenjakov 1987, Macevics 1989). In this method, a chip, or microarray is used consisting of a surface upon which all possible oligonucleotide probes of a particular length k (referred to herein as “k-mers”) are immobilized (Southern 1996). The DNA molecule whose sequence is to be determined, referred to as the “target molecule”, is allowed to hybridize to the k-mers on the chip. The target molecule and the k-mers on the chip may all be single stranded molecules. Alternatively, a double stranded target may first be cut into fragments having single stranded “sticky ends”, and the k-mers on the chip may be the sticky ends of double stranded molecules. Ideally, a single stranded target or the sticky end of a double stranded target hybridizes to a k-mer on the chip if and only if the sequence complementary to the k-mer occurs somewhere in the target sequence or the sticky end. Thus, in principle, it is possible to experimentally determine the “k-spectrum” of the target (the set of all k-long substrings present in the target). In practice, however, the data are ambiguous due to the ability of the target to bind to k-mers that are only partially complementary to one of its substrings. Thus, any binarization of the hybridization signal will contain errors.
The goal of SBH is to determine the target sequence from the target spectrum. However, even if the target spectrum were error free, the target sequence is not uniquely determined by the spectrum. If the number of sequences consistent with the spectrum is large, there is no satisfactory method to select the true sequence. Theoretical analysis and simulations (Southern et al., 1992, Pevzner and Lipshutz 1994) have shown that even when the spectrum is errorless and the correct multiplicity of each k-mer in the target sequence is known, the average length of a uniquely reconstructible target sequence using a chip of 8-mers is only about two hundred nucleotides, far below the length of a DNA molecule that may be sequenced by electrophoresis.
Let Σ=(A,C,G,T) designate the set of nucleotides composing a DNA molecule. M=4 is the “alphabet size”. A DNA sequence is a string over Σ which is denoted herein between braces (< >). The k-spectrum of a target sequence T of length L, T=<t1, t2, . tL>, is the set of all k-long substrings (k-mers) of T. For each k-mer {overscore (x)}=<x1, x2, . xk>in εΣk, we define T({overscore (x)})to be 1 if {overscore (x)} is a substring of T, and 0 otherwise. We denote K=Mk, the number of k-mers. A hybridization experiment measures, for each k-mer {overscore (x)} in εΣk, an intensity of its hybridization with the target.
The result of an SBH experiment may be described by a graph in which each candidate target sequence is represented as a path in a graph (Pevzner et al., 1989). The graph is a directed de-Bruijn graph G(V,E) whose vertices are labeled by all the (k-1)-mers (the set of vertices V=Σk−1), and its edges are labeled by k-mers, (the set of edges E=Σk). The edge labeled <x1, x2 . . . Xk>connects the vertex <x1, x2 . . . xk−1>to the vertex <x2 . . . xk>. There is a 1—1 correspondence between L-long candidate target sequences and (L−k+1)- long paths in G, whose edge labels comprise the target spectrum. Hereafter, we interchangeably refer to edges and their labels, and also to sequences and their corresponding paths.
Since k-mers may reoccur in the target sequence, the paths do not have to be simple. When the spectrum is perfect and the multiplicities of the k-mers in the spectrum are known, every solution is an Eulerian path (Pevzner et al. 1989). In practice, however, the spectrum is not perfect and the multplicities are not known.                Alternative chip designs (Bains and Smith 1988, Khrapko et al 1989, Pevzner et al 1991, Preparata et al. 1999, Ben-Dor et al. 1999), as well as interactive protocols (Skiena and Sundararn 1995) havebeen suggested, often assuming additional information, in order to reduce the ambiguity of the hybridization-based reconstruction.        
Nucleotide sequences from different sources may resemble each other, due to a common ancestral gene. This phenomenon is encountered within a species, between duplicated regions within a genome, and between individuals within a population. Small differences in sequences, referred to as “Single Nucleotide Polymorphisms” or SNPs, efficiently serve as genetic markers that are useful in medicine. Thus the detection and genotyping of SNPs has become an important task of human geneticists. The evolution of homologous sequences from a common ancestral gene is mainly due to nucleotide substitution. Insertions and deletions of nucleotides are also known to have occurred during evolution of homologous sequences, though at lower rates.
A DNA molecule having a known sequence and known to be homologous to a target molecule has not yet been used to reduce the ambiguity of SBH data in order to determine the target sequence.