Completion of the human genome has paved the way for important insights into biologic structure and function. Knowledge of the human genome has given rise to inquiry into individual differences, as well as differences within an individual, as the basis for differences in biological function and dysfunction. For example, single nucleotide differences between individuals, called single nucleotide polymorphisms (SNPs), are responsible for dramatic phenotypic differences. Those differences can be outward expressions of phenotype or can involve the likelihood that an individual will get a specific disease or how that individual will respond to treatment. Moreover, subtle genomic changes have been shown to be responsible for the manifestation of genetic diseases, such as cancer. A true understanding of the complexities in either normal or abnormal function will require large amounts of specific sequence information.
An understanding of cancer also requires an understanding of genomic sequence complexity. Cancer is a disease that is rooted in heterogeneous genomic instability. Most cancers develop from a series of genomic changes, some subtle and some significant, that occur in a small subpopulation of cells. Knowledge of the sequence variations that lead to cancer will lead to an understanding of the etiology of the disease, as well as ways to treat and prevent it. An essential first step in understanding genomic complexity is the ability to perform high-resolution sequencing.
Various approaches to nucleic acid sequencing exist. One conventional way to do bulk sequencing is by chain termination and gel separation, essentially as described by Sanger et al., Proc. Natl. Acad. Sci., 74(12): 5463-67 (1977). That method relies on the generation of a mixed population of nucleic acid fragments representing terminations at each base in a sequence. The fragments are then run on an electrophoretic gel and the sequence is revealed by the order of fragments in the gel. Another conventional bulk sequencing method relies on chemical degradation of nucleic acid fragments. See, Maxam et al., Proc. Natl. Acad. Sci., 74: 560-564 (1977). Finally, methods have been developed based upon sequencing by hybridization. See, e.g., Drmanac, et al., Nature Biotech., 16: 54-58 (1998).
Bulk sequencing techniques are not useful for the identification of subtle or rare nucleotide changes due to the many cloning, amplification, and electrophoresis steps that complicate the process of gaining useful information regarding individual nucleotides. The ability to sequence and gain information from single molecules obtained from an individual patient is the next milestone for genomic sequencing. As such, research has evolved toward methods for rapid sequencing, such as single molecule sequencing technologies.
There have been many proposals to develop new sequencing technologies based on single-molecule measurements, generally either by observing the interaction of particular proteins with DNA or by using ultra high resolution scanned probe microscopy. See, e.g., Rigler, et al., DNA-Sequencing at the Single Molecule Level, Journal of Biotechnology, 86(3): 161 (2001); Goodwin, P. M., et al., Application of Single Molecule Detection to DNA Sequencing. Nucleosides & Nucleotides, 16(5-6): 543-550 (1997); Howorka, S., et al., Sequence-Specific Detection of Individual DNA Strands using Engineered Nanopores, Nature Biotechnology, 19(7): 636-639 (2001); Meller, A., et al., Rapid Nanopore Discrimination Between Single Polynucleotide Molecules, Proceedings of the National Academy of Sciences of the United States of America, 97(3): 1079-1084 (2000); Driscoll, R. J., et al., Atomic-Scale Imaging of DNA Using Scanning Tunneling Microscopy. Nature, 346(6281): 294-296 (1990). Unlike conventional sequencing technologies, their speed and read-length would not be inherently limited by the resolving power of electrophoretic separation. Other methods proposed for single molecule sequencing include detecting individual nucleotides as they are incorporated into a primed template, i.e., sequencing by synthesis.
A significant issue in single molecule sequencing techniques is the presence of homopolymeric regions in sample nucleic acids. Homopolymers are stretches of an identical base that are found throughout the genomes of most organisms. In single molecule experiments, one or more nucleotides are introduced to a template primer complex in the presence of polymerase. Template-dependent nucleotide incorporation takes place as the primer is elongated. However, when the template contains a homopolymer, it is often difficult to determine whether a signal indicating nucleotide incorporation is due to the incorporation of a single nucleotide or to multiple members of the same nucleotide species (e.g., adenine, guanine, thymine, cytosine, uracil, or their analogs) in a homopolymeric stretch on the template. The inability to resolve homopolymers can lead to problems in correctly analyzing and positioning sequence fragments in the genome. The invention addresses this and other problems associated with nucleic acid sequence information processing.