The recent advent of high-throughput sequencing has allowed a detailed profiling of the eukaryotic transcriptome in a genome-wide manner and, over the past few years, next-generation sequencing (NGS) has quickly replaced microarrays for the genome-wide analysis and quantification of RNA samples. In particular, NGS of RNA (“RNA-seq”) has played a central role in defining transcriptional units and evaluating their relative abundance.
In order for any type of quantification to be accurate, the library to be sequenced must accurately reflect the starting pool. This accuracy, however, is especially challenging when working with RNA; in order to make a deep sequencing library, all of the RNAs present must be captured and accurately and efficiently reverse transcribed and amplified into dsDNA.
Eukaryotic mRNA transcripts, though, represent only about 5% of the total RNAs found within a cell, with the rest corresponding to non-coding RNAs; the most abundant of the non-coding RNAs being ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs). As a consequence, RNA samples must be selectively depleted of non-coding RNAs before preparing the samples for sequencing, which can be incomplete or result in bias, the magnitude and type of which are variable. Significantly, because the bias introduced by each method is unique to that method, only libraries prepared in the same way are comparable; directly comparing libraries prepared using different methods can lead to inaccurate conclusions.
Further, while RNA sequencing is able to determine whether a particular genomic locus is transcribed, the resulting information often lacks context. That is, because current deep-sequencing platforms cannot sequence beyond a few hundred base-pairs, the sample RNAs must be fragmented, which results in a loss of important information such as the 5′ and 3′ end sequences or the arrangement of exonic sequences. Unfortunately, the methods that have been developed to address these problems, in turn, have certain limitations and biases.
Therefore, methods for generating cDNA libraries are provided herein that are effective for all types of RNAs and introduces minimal bias. In addition, methods that allow for reliably mapping the 5′ and 3′ ends of transcripts as well as mapping, to a single nucleotide, the length of the poly(A) tail are provided herein. The methods described herein do not possess the limitations and biases of current methods.