Macromolecules are long polymer chains composed of many chemical units bonded to one another. Polynucleotides are a class of macromolecules that include, for example, DNA and RNA. Polynucleotides are composed of long sequences of nucleotides.
The sequence of nucleotides is directly related to the genomic and post-genomic gene expression information of the organism. Direct sequencing and mapping of sequence regions, motifs, and functional units such as open reading frames (ORFs), untranslated regions (UTRs), exons, introns, protein factor binding sites, epigenomic sites such as CpG clusters, microRNA sites, Small interfering RNA (SiRNA) sites, large intervening non-coding RNA (lincRNA) sitesand other functional units are all important in assessing the genomic composition of individuals.
In many cases, complex rearrangement of these nucleotides' sequence, such as insertions, deletions, inversions and translocations, during an individual's life span leads to disease states such as genetic abnormalities or cell malignancy. In other cases, sequence differences as in Copy Number Variations (CNVs) among individuals reflects the diversity of the genetic makeup of the population and their differential responses to environmental stimuli and signals such as drug treatments. In still other cases, processes such as DNA methylation, histone modification, chromatin folding or other changes that modify DNA or DNA-protein interactions influence gene regulations, expressions and ultimately cellular functions resulting in diseases and cancer.
It has been found that genomic structural variations (SVs) are much more widespread than previously thought, even among healthy individuals. The importance of understanding genome sequence with structural variations information to human health and common genetic disease has thus become increasingly apparent.
Functional units and common structural variations are thought to encompass from tens of bases to more than megabases. Accordingly, a method that is direct, inexpensive and yet flexible of revealing sequence information and SVs across the resolution scale from sub-kilobase to megabase along large native genomic molecules is highly desirable in sequencing and fine-scale mapping projects of more individuals in order to catalog previously uncharacterized genomic features.
Furthermore, phenotypical polymorphism or disease states of biological systems, particularly in multiploidy organism such as humans, are consequence of the interplay between the two haploid genomes inherited from maternal and paternal lineage. Cancer, in particular, is often the result of the loss of heterozygosity among diploid chromosomal lesions.
Conventional cytogenetic methods such as karyotyping, FISH (Fluorescent in situ Hybridization) provided a global view of the genomic composition in as few as a single cell, they are effective in revealing gross changes of the genome such as aneuploidy, gain, loss or rearrangements of large fragments of thousands and millions bases pairs. These methods, however, suffer from relatively low sensitivity and resolution in detecting medium to small sequence motifs or lesions. The methods are also laborious, which limits speed and inconsistency.
More recent methods for detecting sequence regions, sequence motifs of interests and SVs, such as aCGH (array Comparative Genomic Hybridization), fiberFISH or massive pair-end sequencing have improved in the aspects of resolution and throughput. These methods are nonetheless indirect, laborious, expensive and rely on existing reference databases. Further, the methods may have limited fixed resolution, and provide either inferred positional information relying on mapping back to a reference genome for reassembly or comparative intensity ratio information. Such methods are thus unable to reveal balanced lesion events such as inversions or translocations.
Current sequencing analysis approaches are limited by available technology and are largely based on samples derived from an averaged multiploidy genomic materials with very limited haplotype information. The front end sample preparation methods currently employed to extract the mixed diploid genomic material from a heterogeneous cell population effectively shred the material into smaller pieces, which results in the destruction of native the crucially important structural information of the diploid genome.
Even the more recently developed second-generation methods, though having improved throughput, further complicate the delineation of complex genomic information because of more difficult assembly from much shorter sequencing reads.
In general, short reads are more difficult to align uniquely within complex genomes, and additional sequence information are needed to decipher the linear order of the short target region.
An order of 25-fold improvement in sequencing coverage is needed to reach similar assembly confidence instead of 8-10 fold coverage needed in conventional BAC and so-called shot gun Sanger sequencing (Wendl M C, Wilson R K Aspects of coverage in medical DNA sequencing, BMC Bioinformatics, 16 May 2008; 9:239). This multi-fold sequencing coverage imposes high costs, effectively defeating the overarching goal in the field of reducing sequencing cost below the $1,000 mark.
Single molecule level analysis of large intact genomic molecules thus provides the possibility of preserving the accurate native genomic structures by fine mapping the sequence motifs in situ without cloning process or amplification. The larger the genomic fragments are, the less complex of sample population in genomic samples, for example, in ideal scenario, only 46 chromosomal length of fragments need to be analyzed at single molecule level to cover the entire normal diploid human genome and the sequence derived from such approach has intact haplotype information by nature. Further, megabase-scale genomic fragments can be extracted from cells and preserved for direct analysis, which dramatically reduces the burden of complex algorithm and assembly, also co-relates genomic and/or epigenomic information in its original context more directly to individual cellular phenotypes.
In addition to genomics, the field of epigenomics has been increasingly recognized in the past 20 years or so as being of singular importance for its roles in human diseases such as cancer. With the accumulation of knowledge in both genomics and epigenomics, a major challenge is to understand how genomic and epigenomic factors correlate directly or indirectly to develop the polymorphism or pathophysiological conditions in human diseases and malignancies. Whole genome analysis concept has evolved from a compartmentalized approach in which areas of genomic sequencing, epigenetic methylation analysis and functional genomics were studied largely in isolation, to a more and more multi-faceted holistic approach. DNA sequencing, structural variations mapping, CpG island methylation patterns, histone modifications, nucleosomal remodeling, microRNA function and transcription profiling have been increasingly viewed more closely in systematical way, however, technologies examining each of above aspects of the molecular state of the cells are often isolated, tedious and non-compatible which severely circumvent the holistic analysis with coherent experiment data results.
Accordingly, there is a need in the art for methods and devices that enable single molecule level analysis of large intact native biological samples so as to enable determination of genomic and epigenomic information of a target sample. Such methods and devices would provide a very powerful tool to researchers and clinicians alike.