An individual gene can often give rise to new proteins in different cells or stages of differentiation, including cells not normally encountered in the life cycle of the organism (e.g., cancer cells; cells in culture; cells in developmental neuro-anatomical anomalies). The different proteins arise from differential patterns of transcription activation and post-transcriptional RNA processing of the messenger RNA (mRNA) that specifies the protein in the expressing cell.
The population of mRNA “transcripts” that are found in a cell is referred to herein as the “transcriptome.” The state of the art for transcriptome sequencing is “RNA-Seq.” See Nature Methods (2008) 5, 621-628. In this approach, mRNAs isolated from a tissue or cell culture are reverse transcribed into complementary DNA (cDNA), and the cDNA is processed and amplified to produce a library of short fragments which are sequenced. mRNA in the cell cannot be profiled by overlapping the sequence of the cDNA fragments and aligning them to a sequence in the genome. The population of most likely mRNAs is, instead, assembled with the use of complex statistical algorithms, the validity of which is an active subject of ongoing of research. RNA-Seq does provide information regarding the tissue-specific ‘exome,’ comprising genomic sequences retained in messenger RNAs, including segments specifying protein coding domains.
RNA-Seq methods do not retain certain information about sequence variants largely because individual mRNA transcripts typically include several variable regions, usually separated by a distance far in excess of the sequencer cDNA read lengths. Which combinations of variable regions are found on the same mRNA transcript is thus unclear.
Consider for illustration a gene that encodes a protein with two “optional” domains separated by 1500 nucleotides: a calcium binding domain (C) near the amino terminus and a calmodulin-binding domain (M) on the carboxyl terminus. The transcripts of this gene may be alternatively spliced to retain both domains (CM), only one domain (cM or Cm) or neither (cm) in the final mRNA. The expressed protein may have four very different physiological behaviors depending on which domains are present. If an RNA-Seq experiment reveals both variations of both domains, one is entirely without recourse to deduce which transcripts are actually present in the original mRNA pool: the data support any of the following sets of transcripts: {CM, cm}, {cM, Cm}, {CM, cm, cM, Cm}, etc. This is because the long region connecting domains C and M contains the same sequence in all transcript variants.
The challenge for large scale cDNA sequencing, as demonstrated in the previous description, is intrinsically linked to the biology of genes of higher species. The uncertainty as to which messages will be expressed in a given cell or stage of cellular differentiation is matched by the uncertainty with which short reads from highly parallel cDNA sequencing can be assigned to particular transcripts. Thus, there is a need to capture more information in the biochemical conduit between genome and proteome.
Fu et al., report molecular indexing enables quantitative targeted RNA sequencing and reveals poor efficiencies in standard library preparations. Proc Natl Acad Sci USA. 2014, 111(5):1891-6.
Certain methods have been described as potentially providing large scale transcriptome sequencing. These are limited in their application. Zamore et al., PCT Publication WO 2011/049955 entitled “Deducing Exon Connectivity by RNA-Templated DNA Ligation/Sequencing,” provide certain sequencing methods including a method in which RNA is annealed to oligomers complementary to known alternative splice junctions each bearing a randomized bar code. This is followed by ligation and subsequent sequencing. The method is limited as it requires prior knowledge of the exon junctions and does not sequence each mRNA in its entirety.
Parallel tagged sequencing (PTS) is also a molecular bar-coding method. See Meyer et al., Nature Protocols, 2007 3, 267-278. The method relies on attaching sample-specific barcoding adapters, which include sequence tags and a restriction site, to blunt-end repaired DNA samples by ligation and strand-displacement. Using the tag sequences, the sample source of each DNA sequence is traced.
Parameswaran et al., Nucleic Acids Res., 2007, 35(19): e130, published a method to increase barcode diversity combinatorially to enable pooled sequencing of libraries from sample sources. Only the sample-specific tags are used. Individual transcripts are not distinguishable, or fully sequenced.
Craig et al., Nat Methods., 2008, 5(10): 887-893 describe a method for multiplexed sequencing of targeted regions of the human genome on the Illumina Genome Analyzer using degenerate indexed DNA sequence barcodes ligated to fragmented DNA prior to sequencing.
Halbritter et al. report high-throughput mutation analysis in patients with a nephronophthisis-associated ciliopathy applying multiplexed barcoded array-based PCR amplification and next-generation sequencing. See J Med Genet. 2012, 49:756-767.
Sharon et al. report a single-molecule long-read survey of the human transcriptome. Nat Biotechnol, 2013, 31:1009-14.
References cited herein are not an admission of prior art.