The transcriptome is the complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition. The transcriptome is central to organismal function, development, and disease. The very nature and vitality of an organism arises from all of the transcripts (including mRNAs, non-coding RNAs and small RNAs), their expression levels, gene splicing patterns, and post-transcriptional modifications. In fact, a much greater fraction of the human genome than was expected is now known to be transcribed. See Bertone, et al., 2004, Global identification of human transcribed sequences with genome tiling arrays, Science 306:2242-2246.
Methods for transcriptome analysis based on next-generation sequencing (NGS) technologies have been reported. See, e.g., Wang, et al., 2009, RNA-Seq: a revolutionary tool for transcriptomics, Nat Rev Genet 10(1):57-63 2009. The RNA-Seq (RNA sequencing) method promises to rapidly generate high volumes of transcriptome data.
However, RNA-Seq faces informatics challenges, such as storing and analyzing large amounts of data, which must be overcome to make good use of the reads. The accepted approach to analyzing RNA-Seq reads involves mapping the short reads from RNA-Seq to a reference genome, or to assemble them into contigs before aligning using such programs as ELAND, SOAP31, MAQ32 and RMAP 33. Unfortunately, short transcriptomic reads present challenges not suited to analysis of, for example, reads that span exon junctions or that contain poly(A) tails. Additionally, with larger transcriptomes, many RNA-Seq reads will match multiple locations in the genome. Further, reads with multiple base mis-matches relative to a reference are difficult to align. Aligning and analyzing RNA-Seq reads presents problems not only in the nature of the information involved but also in the volume.
Some methods for read mapping, transcriptome reconstruction (i.e., identifying expressed genes and isoforms), as well as expression quantification (analysis of differential expression across samples) have been reported. See Garber, et al., 2011, Computational methods for transcriptome annotation and quantification using RNA-Seq, Nat Meth 8(6):469-477. However, due to the nature and volume of RNA-Seq data generation, existing methods may prove inadequate in many cases. Read mapping with large data sets is computationally challenging and analytical methods for differential expression are only beginning to emerge. The computational demands of mapping the large number of reads from RNA-Seq are greatly multiplied by the large amount of reference data that are coming available.
For example, the GENCODE Consortium seeks to identify all gene features in the human genome that have been reported. Harrow, et al., 2012, GENCODE: The reference human genome annotation for The ENCODE Project, Genome Res 22:1760-1774. The volume of material encompassed by the GENCODE project is formidable. Not only do new protein-coding loci continue to be added, the number of alternative splicing transcripts annotated steadily increases. The GENCODE 7 release includes more than 20,000 protein-coding and almost 10,000 long noncoding RNA loci (lncRNA), as well as more than 33,000 coding transcripts not represented in other sources. GENCODE also includes other features such as untranslated regions (UTRs), long intergenic noncoding RNA (lincRNA) genes, short noncoding RNAs, and alternative splice patterns. Even with a resource like GENCODE, methods like RNA-Seq are revealing that differential transcript expression by cell type, tissue type, and development stage has yet to be fully understood, and also that there are many features such as novel exons yet to be documented in the human genome.