One of the most important goals of the human genome project is to provide complete lists of genes for the genomes of human and model organisms. Complete genome annotation of genes relies on comprehensive transcriptome analysis by experimental and computational approaches. Ab initio predictions of genes must be validated by experimental data. An ideal solution is to clone all full-length transcripts and completely sequence them. This approach has gained recognition recently (Strausberg, R. L., et al., 1999, Science, 286: 455-457) and progress has been made (Jongeneel C. V., et al., 2003, Proc Natl Acad Sci USA. 100, 4702-4705). However, due to the complexity and immense volume of transcripts expressed in the various developmental stages of an organism's life cycle, complete sequencing analysis of all different transcriptomes still remains unrealistic.
To get around such a dilemma, a cDNA tagging strategy that obtains partial sequences that represent full transcripts has been developed and widely applied in determining genes and characterizing transcriptomes in the past decade.
In the expressed sequence tag (EST) approach, cDNA clones are sequenced from 5′ and/or 3′ ends (Adams, M., et al., 1991, Science, 252, 1651-1656). Each EST sequence read would generate on average a 500 bp tag per transcript. The number of same or overlapping ESTs would manifest the relative level of gene expression activity. Though ESTs are effective in identifying genes, it is prohibitively expensive to tag every transcript in a transcriptome. In practice, sequencing usually ceases after 10,000 or less ESTs are obtained from a cDNA library where millions of transcripts might be cloned.
To increase the efficiency in sequencing and counting large numbers of transcripts, Serial Analysis of Gene Expression (SAGE) ((Velculescu, V. E., et al., 1995, Science, 270, 484-487; Saha S, et al., 2002, Nature Biotechnology, 20, 508-12; U.S. Pat. Nos. 6,498,013; 6,383,743) and the recent Massively Parallel Signature Sequencing (MPSS) technique (Mao C., et al., 2000, Proc Natl Acad Sci USA, 97, 1665-1670; Brenner S, et al., 2000, Nature Biotechnology, 18, 630-634) were developed based on the fact that a short signature sequence (14-20 bp) of a transcript can be sufficiently specific to represent that gene.
Experimentally, short tags can be extracted from cDNA (one tag per transcript). Such short tags can be efficiently sequenced either by a concatenation tactic (as for SAGE) or by a hybridization-based methodology (as for MPSS). For example, in SAGE, multiple tags are concatenated into long DNA fragments and cloned for sequencing. Each SAGE sequence readout can usually reveal 20-30 SAGE tags. A modest SAGE sequencing effort of less than 10,000 reads will have significant coverage of a transcriptome. Transcript abundance is measured by simply counting the numerical frequency of the SAGE tags.
With the availability of many assembled genome sequences in public databases, the use of a short tag strategy for transcriptome characterization is becoming popular (Jongeneel et al., 2003, Proc. Natl. Acad. Sci. USA 100: 4702-4705). In theory, short DNA tags of about 20 bp can be specifically mapped to a single location within a complex mammalian genome and uniquely represent a transcript in the content of whole transcriptome. However, in reality, there still exist a large number of “ambiguous” SAGE tags (14-21 bp) and MPSS tags (17 bp) that have multiple locations in a genome, and may be shared by many genes. Limited by the availability of type II restriction enzymes that can cut longer than 21 bp, the SAGE method currently cannot generate any longer tags to improve specificity.
Further, SAGE and MPSS methods only produce a single signature per transcript within the gene. In view of the “internal” nature of the tag in a transcript, these methods provide only limited positional and structural information.
Therefore, despite their usefulness in enhancing sequencing efficiency, the utility of methods such as SAGE or MPSS is severely undermined by their lack of specificity and consequent inconclusiveness.
There is a need in the art for more efficient methods which retain the sequencing efficiency of tag-based methods, and at the same time improves upon the use of the tagging strategy for transcriptome characterization and to facilitate the annotation of genomes.