NGS is currently the foremost complete analyzing method. Next generation sequencing is a generic term for parallelized sequencing through polymerization as high-throughput DNA sequencing method. NGS reads sequences of up to many millions fragments which are typically between 10 to several hundred base-pairs long. The complete sequence is obtained by alignment of those reads which is a challenging task due to the sheer number of small reads that have to be assembled to a complete sequence. Some NGS methods rely on a consensus blue print held in genomic and/or transcriptomic databases. The quality of the results depends on length and number of reads, reading accuracy, quality of information in the reference database and applied bioinformatics algorithms. To date many reads provide just limited information. For instance many of the reads cannot be assigned uniquely and therefore are discarded.
In more detail, for generating detectable signals most NGS approaches must amplify individual DNA molecules. Emulsion polymerase chain reaction (PCR) isolates individual DNA molecules using primer-coated beads in aqueous bubbles within an oil phase. Singularizing of DNA molecules, e.g. by rigorous dilution is another option. Another method for in vitro clonal amplification is bridge PCR, where fragments are amplified upon primers attached to a solid surface. Another option is to skip this amplification step, directly fix DNA molecules to a surface. Such DNA molecules or above mentioned DNA coated beads are immobilized to a surface, and sequenced in parallel. Sequencing by synthesis, like the “old style” dye-termination electrophoretic sequencing, uses a DNA polymerase to determine the base sequence. Reversible terminator methods use reversible versions of dye-terminators, adding one nucleotide at a time, detecting fluorescence at each position by repeated removal of the blocking group to allow polymerization of another nucleotide. Pyrosequencing also uses DNA polymerization, adding one nucleotide species at a time and detecting and quantifying the number of nucleotides added to a given location through the light emitted by the release of attached pyrophosphates. The sequencing by ligation method uses a DNA ligase to determine the target sequence. Used in the polony method and in the SOLiD® technology, it employs a partition of all possible oligonucleotides of a fixed length, labeled according to the sequenced position. Oligonucleotides are annealed and ligated. The preferential ligation by DNA ligase for matching sequences results in a dinucleotide encoded colour space signal at that position.
NGS technologies are essentially based on random amplification of input DNA fragments. This simplifies preparation but the sequencing remains undirected. The sheer complexity of the complete sample information simultaneously obtained, is the key hindrance for unambiguous alignment of the reads. Therefore, complexity reduction is essential for increasing the quality of the results.
The classical route for genomic complexity reduction, i.a. employed during the human genome project, is to create BAC (bacterial artificial chromosome) clones prior to sequencing. Distinct stretches of genomic DNA are cloned into bacterial host cells, amplified, extracted and used as templates for Sanger sequencing. Production, maintenance and verification of large BAC libraries are laborious processes and associated with appreciable costs. Due to these impracticalities and the incompatibility with existing NGS platforms it is generally sought to avoid bacterial cloning.
Another option to reduce complexity is to first select polynucleic acids based on their respective sizes. Different approaches include, but are not limited to, agarose gel electrophoresis or size exclusion chromatography for fractionation.
Small RNA sequencing approaches employ this method in order to obtain e.g. a fraction of RNA molecules called micro RNA (miRNA) sized between 15 and 30 nucleotides.
The probably most straightforward approach of complexity reduction is by limiting the amount of input nucleic sample to the genomic DNA of a single cell. Single-cell sequencing approaches rely on amplification reactions from highly dilute solutions, are incapable of actually reducing the complexity inherent to cellular content and are based solely on a selection of the input cells.
A different method for reducing the amount of input nucleic acid to below the amount contained within a single cell sometimes is termed limited dilution. A genomic nucleic acid sample is sheared and diluted to an extent where spatial distribution of the nucleic acid fragments within the sample volume becomes significant. Then subpools are created by taking such small volumes from the total sample volume that most subpools contain no nucleic acids, a few subpools contain one nucleic acid each and even less subpools contain two nucleic acids. This leads to a singularization of nucleic acids and therefore to complexity reduction compared to the full length genome as each singularized nucleic acid is a fragment of a genome. Therefore an increased sequence assembly efficiency for the individual nucleic acid fragment containing subpools is gained. Assembly and scaffold building for large genomes thereby is facilitated. In transcription analysis such a limited dilution approach will not reduce complexity introduced through variations in expression of different genes as each transcript molecule will occupy one subpool and therefore as many subpools are needed as molecules in the sample to display the entire transcriptome of a sample.
A further option is to sequence-specifically reject RNA, e.g. in a hybridization-based approach that removes ribosomal RNA from the entire RNA sample. As opposed to other fractionation methods that rely either on prior sequence information or are directed towards a certain RNA fraction (e.g. polyA selection), removal of rRNA does not bias the sequencing sample. However, the mere removal of ribosomal RNA is restricted to RNA samples and cannot be scaled in terms of complexity reduction.
The duplex-specific nuclease (DSN) method can be used for selectively removing double stranded DNA from the sample solution. This is achieved by letting the single stranded sample interact with excess driver DNA. Driver DNA is made up of sequences designed to remove their targets from the original sample. Upon interaction duplexes are formed, degraded by DSN and the remaining sample may be used for subsequent sequencing. Normalization of sample concentrations may be achieved by amplification using “partial PCR suppression”. This method is not “hypothesis neutral” as it requires preparation of PCR fragments as driver DNA, and therefore prior sequence information.
It is also possible to employ sequence-specific selection methods, e.g. by targeted sequencing of genomic regions such as particular exons. The idea behind such capture arrays is to insert a selection step prior to sequencing. Those arrays are programmed to capture only the genomic regions of interest and thus enabling users to utilize the full capacity of the NGS machines in the sequencing of the specific genomic regions of interest. Low density, on array capture hybridization is used for sequencing approaches. Such technology is not hypothesis neutral, as specific sequence information is required for the selection process.
A similar positive selection can be used for targeted resequencing. E.g. biotinylated RNA strands of high specificity for their complementary genomic targets can be used to extract DNA fragments for subsequent amplification and sequence determination. This form of complexity reduction is necessarily based on available sequence information and therefore not hypothesis neutral.
Sequencing of 16S rDNA or 16S rRNA sequences from mixed samples of microorganisms is i.a. employed for detection of rare species within these samples. By restricting the sequencing approach to a specific signature of microorganisms both complexity and information content are reduced. Frequently only phylogenetic information is obtained.
Tag-based identification of transcripts includes SAGE (Serial Analysis of Gene Expression) wherein sequence tags of defined length are extracted and sequenced. Since the initial creation of tag concatemers is a disadvantage for NGS, derived protocols are used omitting this step.
A related method is CAGE (Cap Analysis of Gene Expression). CAGE is intended to yield information on the 5′ ends of transcripts and therefore on their respective transcription start sites. 5′ cap carrying RNA molecules are selected before end-tags are extracted and sequenced.
Although only defined parts of the transcriptome are extracted for analysis SAGE and CAGE have their limitations because they do not allow for comprehensive segregation.
Several methods for interaction-specific enrichment of the genome exist. ChIP-Seq® is one of several approaches to extract sequences based on their respective affinities to specific proteins (frequently transcription factors). The associated DNA is immuno-precipitated, purified and sequenced. Only a very limited amount of questions is amenable to this approach.
Amplification-driven selection methods (like PCR and isothermal amplification) rely on the specific interaction of DNA oligonucleotides with their respective target DNA. E.g. bioinformatics-selected hexamers that serve as primers can be used for competitive amplification procedures. Such an approach does neither cover the full genome nor is the method scalable in terms of complexity reduction.
Another possibility is selective amplification of a subset of genomic DNA using a circularization approach. In this case a construct including a general primer pair motif which is flanked by two target-specific ends is used. Upon hybridization, ligation to the single stranded target sequence and amplification of the selected polynucleotide using a single primer is possible. Molecular Inversion Probe Capture (derived from initially termed “Padlock Probes”) is used to select sub-sets of genomic DNA. This approach is not hypothesis neutral and limited in scalability.
Hypothesis neutral preparations of genomes that reduce the complexity of the sample have been disclosed in WO 2006/137734 and are based on AFLP technology (EP 0534858). For covering the whole genome a multitude of restriction enzymes must be used. This is laborious, introduces redundancy and still covers the genome only statistically as the pool of restriction fragments may or may not be completely sequenced due to the variability in restriction site distribution.
WO 2007/073171 A2 relates to a method of sequencing cDNA comprising a complexity reduction step by fragmenting cDNA by controlled endonuclease restriction enzymes. Thus, this method is dependent on the presence of proper endonuclease restriction sites in the cDNA sequence and always yields the same fragments for a given cDNA.
WO 2009/073629 A2 describes a shotgun sequencing methods to reduce redundancy in high genome coverage. Nucleic acids are fragmented mechanically or by ultrasound to produce a first shotgun library. The fragments of the first shotgun library are sequenced and the sequence reads are assembled. In a second step, target specific oligonucleotides are synthesized, specific for regions of interest such as locations of single nucleotide polymorphisms, and complexed with the target nucleic acids.
WO 2008/093098 A2 relates to a method for sequencing nucleic acids of at least two samples comprising randomly fragmenting the nucleic acids, ligating universal adaptors to the fragments and amplifying all nucleic acids for sequencing.
WO 2009/116863 A2 describes a method for identifying genomic DNA comprising the steps of generating a cDNA, an optional complexity reduction step, fragmenting the cDNA, optional size selection of the fragments, adaptor ligation, a further size and fragments selection steps, and binding to beads, among many further mandatory steps. This method is work intensive and simplification of complexity reduction for specific uses would be beneficial.
Therefore there is the need of methods that can provide for defined fractions of a nucleic acid sample and provide for means to improve sequencing processes, in particular for improving the assembly of sequences, and for the detection of rare nucleic acid samples e.g. in pools stemming from many organisms or genomes of high concentrations which reduce the chance to obtain sequences of rare nucleic acids.