Recent advances in genome technology have uncovered many variations among healthy and diseased cells, leading to breakthroughs in health care, such as new disease markers and drug targets, new tools for early diagnosis, and personalized therapeutics. Several studies using second-generation whole genome DNA sequencing and re-sequencing have compared DNA from healthy and cancerous cells, and the results have revealed tremendous structural variation, single-nucleotide polymorphisms, and multiple new mutations acquired by tumor cells, which distinguished healthy individuals from others (1-3). Information at the RNA level is also important for understanding mechanisms in the development of disease (4,5). For example, parallel DNA/transcriptome sequencing from a single cell has recently revealed links between gene variation and expression dynamics (6). A third layer of information lies within dynamic changes to the DNA and RNA bases during a cell's life; epigenetic modifications and DNA damage have impact on cell development, disease, and aging (7,8). These and other studies have played a major role in shaping our current vision of genomics, and have been enabled by major advances in high-throughput second-generation sequencing and concentrated efforts and consortia, such as through the ENCODE project) (9).
Third generation sequencing methods, by offering long read lengths and the ability to probe epigenetic states on single, native DNA molecules, have become indispensable tools in genomics (10,11). Probing individual molecules can enable future analysis of sequence and epigenetic information in DNA and RNA, ultimately from as little as a single living cell. However, several significant challenges remain.
Single molecule, real-time (SMRT) sequencing or nanopore-based DNA strand sequencing, by virtue of long read lengths, can resolve complex repeated elements and structural variants (12,13) that are difficult to assemble using second-generation sequencing tools. High sequence coverage and extended contig lengths facilitate gene discovery and whole genome assembly with unprecedented quality (14,15). Recently, Pacific Biosciences released a human genome assembly (60× coverage, read length N50 of 19 kb, and contig N50 of 26.9 Mb), which was critical for filling sequence gaps and uncovering human genome structural variation (14). These assays typically require making libraries from μg amounts of DNA samples. Similarly, targeted sequencing studies also typically require substantial amounts of DNA as starting material, and for cases where that cannot be accommodated, amplification is used to generate enough DNA for library preparation and sequencing.
Third-generation sequencing methods enable quantitative transcriptome analysis by sequencing cDNA libraries with longer reads than second-generation methods. While a recent comparison of second and third generation RNA sequencing methods found comparable performance in terms of bias in gene expression levels (16), long-read third-generation sequencing allows full-length transcripts to be decoded, facilitating assembly-free isoform reconstruction (17). Recently, direct SMRT sequencing of RNA was demonstrated using a reverse transcriptase (RT), showing RNA base modification detection; however, several significant limitations were noted in this feasibility study, including the slow speed, short read lengths, inability to discriminate base repeats, and insensitivity to RNA secondary structures (18).
SMRT sequencing has greatly impacted microbial epigenetics by allowing resolution of methylation patterns in adenines and cytosines, predominantly in prokaryotic DNA (19-21). Oxidative damage and other base lesions in mitochondrial DNA (mtDNA) have a profound impact on understanding disease and aging (22), and SMRT sequencing has been applied to detect mtDNA lesions in single DNA molecules (sequence variation, indels, and damaged bases) (23). Nanopores have also demonstrated methylation detection (24, 25), although the discrimination accuracy is sequence specific, and a general detection platform is not available to date. RNA epigenetic modifications are also common, and thought to play important roles (26), though less is known about these because RNA is typically converted to cDNA, in which modifications are lost. The impact of third generation sequencing applications on understanding the role of epigenetics in mammalian diseases is therefore significantly restricted by prohibitively high input sample requirements or chemical conversion requirements (e.g., amplification and bisulfite treatment) prior to sequencing.
A major challenge common to TGS methods is the inefficiency with which sub-ng input libraries are sequenced. Both SMRT sequencing and nanopore sequencing rely on capture of DNA/RNA into a nanoscale detector. For SMRT sequencing, DNA/polymerase complexes need to be chemically tethered at the bottom of 100 nm diameter nanowells called zero-mode waveguides (ZMWs). Due to geometric constraints, the efficiency of DNA diffusion and binding to the ZMW base sharply decreases for DNA fragments longer than 2 kb (27). Use of magnetic beads provides an approximately 10-fold increase in loading efficiency, although this still prohibits sub-ng level sequencing. Nanopore sequencing relies on threading a single-stranded tail into a 1.5 nm diameter pore, a process that is inherently improbable due to DNA entropy and the small nanopore constriction (28). The amounts of DNA required for current nanopore sequencing methods are orders of magnitude higher than amounts in a human cell (6 pg of DNA and comparable amounts of RNA (30)). Therefore, while library preparation from sub-ng DNA is available (31), sample loss in library preparation steps and DNA loading inefficiency have called for sample amplification in both nanopore-based (32) and SMRT sequencing (27) platforms for very low-input samples. Efficient loading and sequencing of native picogram-level DNA/RNA libraries would constitute a major milestone in genomics by providing a multidimensional palette of genomic, transcriptomic, and epigenomic data from small samples, including single cells.