Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product. These products are functional RNA of which one major important class consists of the protein coding messenger RNAs, mRNA, which are in the process translated to all kinds of proteins like enzymes, transporting molecules, and others. The knowledge of the mRNA content and its processing stage in cells and tissues is important for the understanding of cell genesis, the development of diseases, the drug response of organisms and other biological processes.
Biological cellular processes are affected by numerous internal and external parameters. Herein the entire RNA and in particular the mRNA pool (transcriptome) plays a central role. Typical mammalian cells contain between 10 and 30 pg total RNA which corresponds to 3.6·105 mRNA molecules on average. Current human genome data bases contain 20769 coding genes annotations, 48′461 Genescan gene predictions. While the numbers for gene annotations and gene predictions are quite stable the number of transcripts (now 195565 transcripts) that are annotated continuously increase due to improvements in RNA analytics [Ensembl release 73, September 2013]. The main focus of many investigations is the quantification of protein coding RNA, the mRNA or transcripts. Individual genes can express numerous different transcripts, so called splice variants, which are characterized through differences in their exon region, and/or differences of the start- and end sites of the untranslated regions which are important for regulatory processes.
Different methods have been developed to measure either mRNA or gene expression levels with different degrees of accuracy.
Expressed sequence tags, EST, are short sub-sequences of cDNA and result from one-shot sequencing of a cloned cDNA. They were used in the past to identify gene transcripts. Millions of ESTs are available in public databases and provide information on the conditions in which the corresponding genes are expressed. The ESTs enable the design of probes for DNA microarrays to measure gene expression.
Classical methods for gene expression measurements such as microarray hybridization assays, or more recent methods such as mRNA sequencing by massive parallel sequencing or next-generation sequencing, NGS, are limited through the inherent inaccuracy of the methods which can currently only to some extent be compensated through more measurements, like deeper sequencing, which inevitably increases the costs to such extent that analyses cannot be carried out on large sample throughput scales. However, accuracy in the measurements and also costs are the upmost requirements in pharmacological research and large, clinical scale studies. Microarrays can only detect genes on the exon or sequence level for which predetermined sequence probes have been designed before the experiments. The limited number of such hybridization probes and mis-hybridization often led to ambiguous results for high resolution gene expression experiments. Microarrays are limited by design because they can cover only a certain number of different 3′UTRs (3′ untranslated region) and cannot identify new 3′UTRs.
At the end of 1996 new high-throughput sequencing technologies [WO 98/44151] started to emerge and became known as next-generation sequencing, NGS, in contrast to the thitherto common dideoxy method after Sanger. The development of new sequencing technologies made it possible to attempt the sequencing of entire transcriptomes. NGS uses miniaturized and parallelized flow cells for sequencing millions of short, between 50 and 400 bases long, single or paired end reads. Spatially separated, clonally amplified DNA templates are sequenced by synthesis in such way that decoding occurs while adding individual nucleotides to the complementary strands. Optical scanning (Illumina systems from Illumina, Inc., US; SOLiD systems from Life Technologies, US; Roche 454 from 454 Life Sciences, Roche Diagnostics Corp., US) and the detection of tiny pH changes through arrayed microchip field effect transistors (Ion Torrent from Life Technologies, US) are used in different microfluidic platforms. The millions of short reads must be aligned to either known sequences or de novo assembled. For RNA research, however, the situation is more complex because sequences of transcripts from individual genes overlap to large extents. Annotations of previously found transcript variants provide frameworks to guide the subsequent transcript assembly on the basis of the discovery of individual exons, exon-exon junctions and coverage probabilities. Only the correct transcript assembly allows assigning reads to their parental RNA molecules and, further, the calculation of the respective copy numbers.
Independent of the NGS technology, the simultaneous determination of sequence and frequency information is one major problem in researching complex sequence mixtures. Because only its sequence determines the nature of the molecule it seems to be inevitable to repetitively sequence identical molecules proportional to their abundance for counting their corresponding copy numbers. A dynamic range of six orders requires a repetitive sequencing through millions of identical highly abundant molecules before reaching statistically sound values for low abundant molecules. Such approaches are resource and time consuming during sequencing and subsequent data analysis. The required read depth depends heavily on the complexity of the sample [Hopper, 2010; Wendl, 2009]. After all, one major challenge is the entanglement of aligning overlapping reads to multiple overlapping transcript annotations within individual genes. The efforts and costs in read depth and computation are enormous. Therefore, different approaches have been developed which eliminate the need for aligning overlapping reads by just producing one read per mRNA molecule. Grouping and counting such reads simplifies the mRNA and gene expression measurements [WO02/059357].
Polyadenylation of pre-mRNA is one important step of eukaryotic gene expression and regulation. Many genes produce mRNAs with alternative polyadenylation sites, APA, and distinct 3′UTRs which can be differently regulated or which can encode also for different protein isoforms. Therefore, to combine the simplicity of determining gene expression values by generating just one read per mRNA with the precise identification of polyadenylation sites methods for exclusively targeting those APAs were developed.
One such method identifies polyA-sites in a genome-wide and strand specific manner [Wilkening, 2013]. Here, libraries for NGS sequencing are prepared through: heat fragmentation of the RNA sample, solid phase reverse immobilization, SPRI, purification to stop further fragmentation through buffer exchange, reverse transcription after priming with biotinylated and anchored polyT (V)-primer-adaptor, SPRI purification to remove of all non-polyA containing fragments and to exchange the solution, Rnase H treatment to degrade the RNA and to use the smaller RNA fragments as random start sequences for the second strand synthesis with DNA polymerase I which generates the longest possible double strand because all other inner extended priming sites will be displaced through strand displacement, SPRI purification, Streptavidin affinity purification and binding which enables the solution exchanges after each of the following 3 steps, enzymatic end repair, single dA tailing, ligation of another adaptor, followed by an enrichment PCR, and SPRI purification.
The resulting NGS libraries contain just one read per mRNA molecule, although one read per mRNA marks the theoretical maximum. In practical terms, because each of the many reaction steps of the library generation has an efficiency below 100%, the result is a distorted, and in the aspired realization proportionally distorted, representation of the transcript abundances. It is important that the number of reads per transcript species is proportional to their copy number and not to their length or any other sequence specific biases. The labor-, chemicals- and consumable intensive method is advantageous for gene expression measurements because it allows quantifying RNA abundances through simple read counting because only one read is produced from each transcript. The method continues with a particular NGS protocol which silently reads through the polyT-stretch of the primer-adaptor before the real sequencing starts. This part is termed 3′T-fill method. In addition, expression levels of polyA-site isoforms can be detected and quantified with a resolution of single nucleotide sequence, or after merging polyA-sites of close proximity to respective clusters. Beside better quality in the read generation the main improvement in the protocol was the introduction of said 3′T-fill which enabled the sequencing from the very end of the transcripts.
Other polyA-site enrichment methods had been developed before but without the aforementioned 3′T-fill. Because internal references of transcript variants are missing it is hard to judge the different qualities of the methods. One simpler method is the multiplexed analysis of polyA-linked sequences, MAPS [Fox-Walsh, 2011]. Herein, a biotinylated oligo-dT (NV) containing adaptor sequence is used to prime cDNA synthesis. Upon solid phase selection, second strand synthesis is initiated by using a random primer which is linked to another adaptor sequence. Finally, the library is released from Streptavidin-coated beads and amplified using a bar-coded primer together with a common primer. This method has likewise the ability to robustly detect gene expression. Although, the read direction was originally directed towards the 3′-end of the mRNA, and only a very narrow size selection of the library would enable to read into the polyA-site, the exchange of the adaptor (primer) sequences and the combination with above described 3′T-fill method allows also the precise detection of the polyA sites with all reads.
The method has several pitfalls. It aims to synthesize full length cDNA, is protecting the ends of the cDNA with didesoxyribonucleosidetriphosphate, ddNTP, before binding the cDNA to Streptavidin-bead surfaces, purifying the cDNA by these means, priming and extending second strands with Taq DNA polymerase. Taq DNA polymerase degrades any encountered downstream strands via a 5′->3′ exonuclease activity and has been chosen to ensure that only one second strand per cDNA, the one which has been primed farthest from the polyA-site, is produced before purifying the double stranded product through the mentioned affinity binding method. Because of the long cDNA the NGS libraries are by trend long which would lead to length biases in the later NGS cluster generation. While the second strand synthesis occurs on the bead surfaces it is hindered in particular in the region of the interface towards the sequence of the first, biotinylated, primer sequence. The multiple purification steps which are assisted by surface confined reactions introduce a series of length and sequence biases in the generation of authentic polyA-site reads.
Another deep sequencing based method is the quantitative polyA site sequencing, PAS-sequencing [Shepard, 2011]. This method starts with a fragmentation step to generate RNA fragments of the desired size range. Again, the first adapter sequence is part of anchored oligo-dT (NV) primer. This method takes advantage of the terminal transferase activity of reverse transcriptases. Upon reaching the 5′-end of the mRNA fragment the MMLV-V reverse transcriptase adds a few untemplated deoxycytodines to the 3′-ends of the cDNA. Those ends hybridize with second adapter which contains a triple G sequence. The reverse transcriptase continues by switching the template and synthesizing a copy of the mRNA fragment which is now extended by both adapter sequences.
A major drawback of this very simple method is its inefficiency of only 1-10%, bias and inaccuracy of the template switch. Low efficiency will result in losses of low abundant transcripts. Template switching is not exclusively coupled to the template switch primer and artificial fusion transcripts may be generated by switching to different RNA templates. Also, the template switch primer has to be provided in a large excess, making a purification step before the subsequent library amplification essential.
Another polyA-seq method has been described by Derti et al. [2012]. The protocol employs first strand synthesis with anchored polyT-primers containing the first adaptor sequence, RNAse H treatment to digest RNA before, priming with a random primer which contains the second adaptor sequence, and Klenow-extension for the second strand synthesis. Although the Klenow DNA polymerase I fragment lacks 5′->3′ exonuclease activity it contains persistent strand displacement activity. Therefore, each first strand cDNA can generate several randomly primed second strands. The unambiguous bijective mRNA abundance and read counting correlation is not ensured.
U.S. Pat. No. 6,406,891 B1 relates to a method for generating a full-length cDNA with a method comprising cycling back and forth between a processive RT and a thermostable RT enzyme during first strand synthesis.
EP 1371726 A1 relates to a first and second strand synthesis method. For first strand synthesis bead immobilized primers and for second strand synthesis random hexamers are used. Second strand synthesis is with a mixture of Klenow, which contains strand displacement activity.
Costa et al. [2010] relates to transcriptome studies using RNA-seq.
Mainul Hoque et al. [2012] relates to the analysis of alternative cleavage and polyadenylation by 3′ region extraction and deep sequencing.
For gene expression counting the need for reliable, efficient, simple and cost effective methods to produce NGS library amplicons which possess a bijective correlation between mRNA abundance and read count exists.