The transcriptome is the complete set of transcripts in a cell, and their quantity, for a specific developmental stage or physiological condition. The transcriptome is central to organismal function, development, and disease. The very nature and vitality of an organism arises from all of the transcripts (including mRNAs, non-coding RNAs and small RNAs), their expression levels, gene splicing patterns, and post-transcriptional modifications. In fact, a much greater fraction of the human genome than was expected is now known to be transcribed. See Bertone, et al., 2004, Global identification of human transcribed sequences with genome tiling arrays, Science 306:2242-2246.
Methods for transcriptome analysis based on next-generation sequencing (NGS) technologies have been reported. See, e.g., Wang, et al., 2009, RNA-Seq: a revolutionary tool for transcriptomics, Nat Rev Genet 10(1):57-63 2009. The RNA-Seq (RNA sequencing) method promises to rapidly generate high volumes of transcriptome data.
However, RNA-Seq faces informatics challenges, such as storing and analyzing large amounts of data, which must be overcome to make good use of the reads. Aligning and analyzing RNA-Seq reads presents problems not only in the nature of the information involved but also in the volume. Read mapping with large data sets is computationally challenging and analytical methods for differential expression are only beginning to emerge. See Garber, et al., 2011, Computational methods for transcriptome annotation and quantification using RNA-Seq, Nat Meth 8(6):469-477. The computational demands of mapping the large number of reads from RNA-Seq are greatly multiplied by the large amount of reference data that are coming available.
For example, the GENCODE Consortium seeks to identify all gene features in the human genome that have been reported. Harrow, et al., 2012, GENCODE: The reference human genome annotation for The ENCODE Project, Genome Res 22:1760-1774. The volume of material encompassed by the GENCODE project is formidable. Not only do new protein-coding loci continue to be added, the number of alternative splicing transcripts annotated steadily increases. The GENCODE 7 release includes more than 20,000 protein-coding and almost 10,000 long noncoding RNA loci (lncRNA), as well as more than 33,000 coding transcripts not represented in other sources. GENCODE also includes other features such as untranslated regions (UTRs), long intergenic noncoding RNA (lincRNA) genes, short noncoding RNAs, and alternative splice patterns. Even if an RNA-Seq study started with only a limited amount of new data, the volume of potential reference data from a source such as GENCODE is so large that making sense of the new data is computationally challenging.