Large-scale sequence analysis of genomic DNA is central to understanding biological phenomena in humans and in many economically important plants and animals. Sequence analysis of whole genomes, particularly analysis of the three billion base pairs in the human genome, involves a level of complexity that is compounded by the requirement for accuracy and speed for applications related to applications such as clinical diagnostics. In general, 60 billion or more sequence data points must be analyzed to provide an accurate genome sequence read.
Early sequencing methods generated sequence data from thousands of isolated, very long fragments of DNA to preserve the contextual integrity of the sequence information and reduce the need for redundant testing to obtain accurate results. However, such methods cost hundreds of millions of dollars per genome due to the complexity of preparing the genome fragments and the relatively high cost of the individual biochemistry tests used to generate sequence data from those fragments.
Advancements in fixed array technologies reduced the complexity of the preparation of the genomic fragments by providing the means to fragment a genome into millions of short pieces and computationally weave the genome sequence though deep redundant sequence analysis. Such advancements reduced the cost of genome sequencing from hundreds of millions to hundreds of thousands of dollars. However, these array technologies can be limited in applicability, because they are not able to provide contextual information, particularly the contextual information inherent in the fact that there are two distinct copies of the genome in each human cell. Accurate sequence analysis, particularly for clinical analysis and diagnosis, requires the ability to distinguish sequence differences between the two unique copies of the three billion DNA bases interspersed with millions of inherited single nucleotide polymorphisms, hundreds of thousands of short insertions and deletions, as well as hundreds of spontaneous mutations. Many methods for applying long read strategies to single molecules that could provide this contextual information are not compatible with the processivity scale up required to ensure accurate sequencing in clinically relevant time frames and at a clinically amenable cost. In addition, many conventional sequencing techniques are not effective in the analysis of arrays of single molecules, because the signal associated with single molecules are often not intense enough to overcome noise inherent in such systems. A cost-effective and highly accurate sequencing technology that provides the ability to read long single nucleic acid molecules is therefore desirable.