Prokaryotic genes differ from eukaryotic genes in that every base pair in a prokaryotic gene is reflected in the mRNA base sequence. In eukaryotic genes there are often intervening sequences which do not appear in the mRNA base sequence for the gene product. The DNA sequences which are expressed and retained in the final product of mRNAs are “exons”. The intervening sequences which are not expressed are called “introns”.
Genomic DNA sequence, including exons and introns, are transcribed to produce a precursor of the mature mRNA or pre-mRNA. Genes from eukaryotic organisms contain a variable number of introns of varying sizes, which range from more than 20 bp to 800 kp. For example, the gene for mouse Tbc1d2 gene encoding TBC1 domain family, member 2 contains 12 introns, the mouse Col1a1 gene coding for procollagen, type I, alpha 1, contains 50 introns.
During the processing of pre-mRNA, the introns are excised out and the exons are spliced and joined together to generate a mature mRNA, which is exported into cytoplasm for translation into protein. Aberrations in pre-mRNA splicing have played an essential role in almost every known disease with genetic aetiology, disease susceptibility and severity and maybe in all aspects of life including development, differentiation, aging and cancer. See Baralle, D., Lucassen, A., Buratti, E., Missed threads. The impact of pre-mRNA splicing defects on clinical practice. EMBO Rep. 2009:10(8):810-6 (“Baralle”); Cooper T A, Wan L, Dreyfuss G., RNA and disease. Cell 2009:136(4):777-93 (“Cooper”); Belfiore, A., Frasca, F., Pandini, G., et al., Insulin receptor isoforms and insulin receptor/insulin-like growth factor receptor hybrids in physiology and disease. Endocr. Rev. 2009:30(6):586-623 (“Belfiore”).
Introns are removed from pre-mRNAs via two consecutive trans-esterification reactions before mature mRNAs are exported from the nucleus into cytoplasm for translation into proteins. See Black, D. L., Mechanisms of alternative pre-messenger RNA splicing. Annu. Rev. Biochem. 2003:72:291-336 (“Black”). Intron removal from pre-mRNAs is mediated by spliceosomes, which are known to be comprised of several hundred proteins and five small snRNAs packaged as ribonucleoprotein particles (RNPs). See Black; Sanford J. R., Gray N. K., Beckmann K., et al., A novel role for shuttling SR proteins in mRNA translation. Genes Dev. 2004:18(7):755-68 (“Sanford”); Moore, M. J., From birth to death: the complex lives of eukaryotic mRNAs. Science 2005:309(5740):1514-8 (“Moore”). In brief, the 5′ intronic conserved sequence, GURAGU, of pre-mRNAs is base-paired with the 5′ end of U1 snRNP and the conserved branch-point and 3′ splice site of pre-mRNAs are recognized by U2 snRNP1. See Black. The pre-assembled U4/U6, U5 tri-snRNPs associates with pre-mRNA and snRNPs already bound to pre-mRNA. This dynamic rearrangement leads to 2′-hydroxyl of adenosine of the branch-point to attack the last nucleotide of 5′ exon, producing the “free” 5′ exon and lariat intron-3′ intron intermediates. In the second step the 3′ hydroxyl of the 5′ exon attacks 3′ splice site to generate a spliced mRNA and lariat intronic product.
Many approaches have been developed to predict pre-mRNA splicing and alternative splicing with only a limited success. Introns were first identified by highly conserved sequences, which begin with highly conserved sequence among different eukaryotic organism, GTRAGT, and end with (C/T)AG (“Black”). Traditionally, alternative spliceovariants were identified by aligning different cDNAs/ESTs to the different regions of the same genomic sequences. See Zhuo, D., Zhao, W. D., Wright, F A, Yang, H. Y., and Wang, J. P. et al., Assembly, annotation, and integration of UNIGENE clusters into the human genome draft. Genome Res. 11(5): 904-918 (2001) (“Zhuo I”); Brent, M. R., Steady progress and recent breakthroughs in the accuracy of automated genome annotation. Nat Rev Genet 9(1): 62-73(2008) (“Brent”); Kim, N. and Lee, C., Bioinformatics detection of alternative splicing. Methods Mol Biol 452: 179-197 (2008) (“Kim”); Bonizzoni, P., Mauri, G., Pesole, G., Picardi, E., Pirola, Y. et al. Detecting alternative gene structures from spliced ESTs: a computational approach. J Comput Biol 16(1): 43-66 (2009) (“Bonizzoni”).
Comparative analyses exploit homology searches to identify highly conserved exon-intron boundaries. See Lee, C., Wang, Q., Bioinformatics analysis of alternative splicing. Brief Bioinform 6(1): 23-33 (2005) (“Lee and Wang”). Two approaches have been used: inter-genomic or cross species comparisons. See Clark, A. G., Eisen, M. B., Smith, D. R., Bergman, C. M., Oliver, B. et al. Evolution of genes and genomes on the Drosophila phylogeny. Nature 450(7167): 203-218 (2007) (“Clark”). Effectiveness of both approaches is limited by constraints of phylogenetic distance and homologies within databases. Neural networks, Fourier transforms and Markov models have been developed to predict the gene structures. See Lu, D. V., Brown, R. H., Arumugam, M., Brent, M. R., Pairagon: a highly accurate, HMM-based cDNA-to-genome aligner. Bioinformatics 25(13): 1587-1593 (2009) (“Lu”). The statistics programs require a set of parameters, which are often estimated, based on training datasets of well-characterized sequences. See Brent.
Deep sequencing of the human transcriptome makes it possible to identify novel splice sites. See Hartmann, L., Theiss, S., Niederacher, D., et al, Diagnostics of pathogenic splicing mutations: does bioinformatics cover all bases? Front. Biosci. 2008:13:3252-72 (“Hartmann”); Sultan, M., Schulz, M. H., Richard, H., et al., A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science 2008:321(5891):956-60 (“Sultan”). Using polyA capture, RNA-seq. and other methods, Mangone et al. identified large numbers of cis- and trans-alternative splicing isoforms originated from C. elegans 3′ UTR. See Mangone, M., Manoharan, A. P., Thierry-Mieg, D., et al., The landscape of C. elegans 3′ UTRs. Science:329 (5990):432-5 (“Mangone”). Using paired-end RNA sequencing and RNA-seq, surprisingly >23,000 introns have been identified in D. melanogaster. See Soller, M., Pre-messenger RNA processing and its regulation: a genomic perspective. Cell. Mol. Life Sci. 2006:63(7-8):796-819 (“Soller”); Chen, M., Manley, J. L., Mechanisms of alternative splicing regulation: insights from molecular and genomics approaches. Nat. Rev. Mol. Cell. Biol. 2009:10(11):741-54 (“Chen”).
To solve diversity and specificity of pre-mRNA splicing and alternative splicing, exonic and intronic splicing enhancers and silencers have been suggested to be potential candidates of splicing codes. See Fu, X. D., Towards a splicing code. Cell 2004:119(6):736-8 (“Fu”); Matlin, A. J., Clark, F., Smith, C. W., Understanding alternative splicing: towards a cellular code. Nat. Rev. Mol. Cell. Biol. 2005:6(5):386-98 (“Matlin”); Wang, G. S., Cooper, T. A., Splicing in disease: disruption of the splicing code and the decoding machinery. Nat. Rev. Genet. 2007:8(10):749-61 (“Wang”). More recently, Barash et al. used computation methods to assemble several hundreds of RNA features (the “splicing code”) to predict tissue-dependent changes in alternative splicing for thousands of exons. See Barash, Y., Calarco, J. A., Gao, W., et al., Deciphering the splicing code. Nature: 465(7294):53-9 (“Barash”). Although this splicing code model may explain some tissue-dependent alternative splicing, unlike genetic codes, it fails to explain the conundrums of university, diversity, specificity and fidelity of pre-mRNA splicing as does the nature of splice site choice in alternative splicing. See Soller; Chen.
Ever since their discovery about 30 years ago, introns have intrigued the scientific community and stimulated debate about the nature and timing of their origin. See Black; Roy, S. W., Gilbert, W., The evolution of spliceosomal introns: patterns, puzzles and progress. Nat. Rev. Genet. 2006:7(3):211-21 (“Roy I”); Rodriguez-Trelles, F., Tarrio, R., Ayala, F. J., Origins and evolution of spliceosomal introns. Annu. Rev. Genet. 2006:40:47-76 (“Rodriguez-Trelles”). There has also been curiosity about the apparent recent explosion in intron number in mammals and its contribution to expanded protein diversity and regulation through alternative splicing pathways. See Pan, Q., Shai, O., Lee, L. J., et al., Deep surveying of alternative splicing complexity in the human transcriptome by high-throughput sequencing. Nat. Genet. 2008:40(12):1413-5 (“Pan”); Nilsen, T. W., Graveley, B. R., Expansion of the eukaryotic proteome by alternative splicing. Nature:463(7280):457-63 (“Nilsen”). Correct removal of introns from genes has become a central issue in the medical research and biological sciences. However there currently are no known methods to accurately identify the introns, that is, to accurately define exon/intron boundaries.