The goal set by National Human Genome Research Institute to promote the development of technology for sequencing mammalian-sized genomes for under $1000. was a dramatic acknowledgement of the tremendous value that nucleic acid sequence data has in virtually every area of the life sciences, Collins et al (2003), Nature, 422: 835-847. This challenge has spurred interest in many different sequencing approaches as alternative to, or complements of, Sanger-based sequencing, which has been the work-horse sequencing technology for the last two decades, e.g. Margulies et al (2005), Nature, 437: 376-380; Shendure et al (2005), Science, 309: 1728-1732; Kartalov et al, Nucleic Acids Research, 32: 2873-2879 (2004); Mitra et al, Anal. Biochem., 320: 55-65 (2003); Metzker (2005), Genome Research, 15: 1767-1776; Shendure et al (2004), Nature Reviews Genetics, 5: 335-344; Balasubramanian et al, U.S. Pat. No. 6,787,308; and the like. A common attribute of many of these new approaches is the acquisition of sequence information from many short randomly selected fragments in a highly parallel manner. Massive amounts of sequence information are generated that must be processed to reconstruct the sequence of the larger polynucleotide from which the fragments originated. Unfortunately such processing presents a significant hurdle to many genome sequencing projects because of the well-known difficulties of reconstructing long polynucleotides from short sequences, e.g. Drmanac et al, Advances in Biochem. Engineering, 77: 75-101 (2002).
Another difficulty faced by current and developing sequence technologies arises from the diploid nature of many organisms of interest. That is, the cells of all mammals and many other organisms of interest contain two copies of every genomic sequence and the pair of such sequences differ from one another by a small but significant degree due to natural allelic variation, mutations, and the like. Thus, when diploid genomes are reconstructed from shorter sequences, it is very difficult to determine which difference should be allocated to which sequence of the pair. A similar difficulty arises when sequencing populations of organisms as well, e.g. Tringe et al (2005), Nature Reviews Genetics, 6: 805-814. In the latter case, there are mixtures of pathogens (for example, HIV or other viruses) where complete viral or bacterial strain or haplotype determination is critical for identifying an emerging resistant organism or man-modified organism mixed with non-virulent natural strains.
In view of the above, it would be highly useful, particularly to many sequencing technologies under development, to have available a technique that would allow the generation of additional information about the location of short sequence reads in a genome.