Current methods for RNA-seq library preparation attempt to uniformly sample all sequences across every RNA molecule, optimally with sufficient overlap to allow de novo reassembly of the RNA sequences from which they derive, or alternatively, to allow inference of RNA sequence by alignment with reference sequences. RNA-seq methods use various length nucleic acid molecules, and from these molecules, generate sequence reads of varying length. Some methods can generate short sequence reads of ˜125 nt in length, and can include almost every sequence in the transcriptome, including splice junctions (1, 2). These methods are referred to herein as “standard RNA-seq library preparation methods” or the “standard methods”. These standard methods have in common that the libraries of sequences that they contain correspond to the sequences of genes, and more directly, from the messenger RNAs (i.e. mRNAs) transcribed from genes. The libraries include RNA sequences from DNA regions that are not necessarily considered to be genes, including but not limited to microRNAs, short interfering RNAs, long non-coding RNAs, and others. Similar libraries contain sequences directly obtained from DNA, including but not limited to genomic DNA, organelle DNA, mitochondrial DNA, chloroplast DNA, pathogen DNA, commensal organism DNA, viral DNA, parasite DNA, and symbiotic organism DNA. The term “standard methods” will be used herein to refer to both RNA and DNA library preparation methods.
One common goal of nucleic acid sequencing is to quantify the presence of nucleic acid molecules, and another common goal is to reconstruct a partial, a nearly complete, or a complete representation of a target molecule. This is typically done by concatenating partial sequence reads according to shared homology. Alternatively, a partial, a nearly complete, or a complete representation of a target molecule can often but not always be assembled by homology to one or more reference sequences.
Some existing standard methods for high-throughput nucleic acid sequencing produce sequence “reads” that are short relative to the length of the molecules being sequenced and relative to molecular features of interest. The molecules being sequenced will be referred to herein as “target molecules”.
As an example of what may be encountered using other sequencing technologies, when a sequencing method gives reads of about 100 nt, alternative features that are greater than 100 nt apart can give rise to computational ambiguity.
As a further example of what may be encountered using other sequencing technologies, in certain cases, short sequence reads including sequences from one exon, sequences from two exons and an intervening splice junction, or sequences from two or more exons and two or more splice junctions can be insufficient to reconstruct or infer the structures of the molecules from which they derive. As a further example, target DNA molecules sometimes contain alternative regions that are flanked by shared regions, such that sequence reads are too short to reconstruct the molecules from which they derive.
Computational problems arise when using short sequence reads to reconstruct target molecule sequences. Complications in analysis of RNA arise due to differential promoter usage, splicing isoforms, alternative polyadenylation, and opposite strand transcription. Genes often encode complex patterns of isoforms due to alternative splicing (2-4). Algorithms developed to identify isoform models from short sequence reads produced by RNA-seq data alone are limited to cases in which the proportions of isoforms can be determined or estimated algebraically from exon and splice junction frequencies, and these limitations have been discussed (5-13). Other approaches for assembling isoform models exist, including pathway and probability models. In most cases, the correct isoform model is indistinguishable from incorrect models when linear equations that relate the frequencies of isoforms to the frequencies of exons and exon junctions are under-constrained, in which case, the correct model is considered unidentifiable. This constraint is not circumvented by pathway or probability approaches to isoform model identification. In the sense used here, unidentifiability is not related to errors or inadequacies in measurement. There exist cases in which a gene produces minor, or rare, isoforms which may or may not be of interest, and which may or may not be detected. In such cases, there may be a preferred model in which some minor isoforms are ignored and which comprises an acceptable approximation to the correct model. A preferred model is, therefore, an approximation to the correct model in which various isoforms, usually rare isoforms, may be excluded from analysis. A preferred model need not account for all isoforms, such as minor or rare isoforms. As is the case for the correct model in which all isoforms are accounted for, a preferred model is rendered unidentifiable when the equations that relate the frequencies of isoforms to the frequencies of exons and exon junctions are under-constrained. In such cases, both the correct and preferred models are unidentifiable. For example, when there are five exons, ABCDE, and the actual isoforms are ABCDE, ACDE, ABCE, and ACE, and are present in measurable quantities, neither the correct model, nor the preferred model can be identified from short sequence reads. FIG. 1 shows that an algebraic model based on read frequencies for a hypothetical gene with this isoform structure has an infinite number of solutions. Some RNA-seq methods produce possibly non-overlapping sequences from both ends of the library molecules, producing so called “paired end reads”. Paired end reads do not solve the isoform problem (13). In real data, measurement error can or will cause convergence on a fictitious model. FIG. 1 is generally representative of the state of the art prior to the present invention and problems associated therewith.
This is an important problem to be addressed, given the potential phenotypic consequences of changes in alternative splicing in normal development and pathology. As a non-limiting example, in normal cortical development, the Numb gene produces isoforms numb1 and numb3 in undifferentiated cortical progenitors, whereas isoforms numb2 and numb4 are expressed during differentiation; re-expression of numb1 or numb3 resulted in alteration of neuronal development (14). As another non-limiting example, alternative isoforms of androgen receptor, often caused by mutation, lead to constitutively active androgen receptor, which is thought to drive cancer cell proliferation in castration-resistant prostate cancer (15).
Hiller et al. (6) considered 2256 genes known to yield multiple isoforms and found that only 68 genes (3%) lead to unidentifiable models using these identifiability criteria. Hiller et al. (6) pointed out that most of the 68 failed models contained an isoform with an exon flanked by two alternative splicing events, i.e. isoforms such as ACE, and that this becomes a significant problem when there are 5 or more isoforms. At that time, RefSeq associated only 133 of 2256 genes with 5 or more isoforms, whereas, currently, three times as many genes, 405 of 2256 genes, are associated with 5 or more isoforms, suggesting that about 9% of genes have isoform structures that cannot be modeled, but this estimate could easily increase as knowledge of the transcriptome increases.