One the most burdensome tasks in the post genome-sequencing era is the accurate and complete annotation of all genes and their products, primarily mRNA transcripts of a sequenced genome. Bioinformatics analyses of fragmentary experimental data have led to widely varying estimates of the number of human genes. Human EST assembly resulted in 89,000 Unigene clusters; ab initio genome annotation identified approximately 30,000 genes by two independent studies; and the manually curated RefSeq database has only 17,000 genes identified with stringent evidences. It is apparent that current technologies applied for genome annotation including computational gene prediction, cDNA cloning and sequencing, and other new technologies are inefficient, incomplete, and unconvincing.
Computational methods including homology studies, domain searches, and ab initio gene predictions have great limitation and fallibility. Current prediction programs may be fine for many ‘internal’ exons, but perform particularly poorly on border exons in UTR regions. They need to be trained by more experimental data. The precise annotation of every gene in complex genomes by computation methods alone is still a distant goal.
Although cDNA cloning and sequencing, including EST, full-length, and OFESTES, have generated immense data in EST and full-length cDNA sequences, low abundance and large size transcripts are discriminated against during cloning steps. Library-based cDNA approaches are incomplete for identifying all transcripts due to high redundancy of abundant transcripts and high percentage of truncated clones. A comprehensive cDNA library approach may be efficient for capturing the first 50-70% of all expressed transcripts, but it soon becomes prohibitively expensive and inefficient for getting the rest, in particular the rare transcripts.
Genome-wide scans by oligonucleotide microarrays provide another strategy that has the potential to help annotate complex genomes. In this approach, oligo probes representing predicted exons are synthesized, micro-arrayed, and subsequently hybridised to mRNA samples. Experimental data generated would provide validations to true exons. This approach is expected to be efficient to examine many different biological stages and environmental conditions for expressed transcripts. However, a major limitation of this method is its ability to convincingly determine the existence of rare genes because the signal detection sensitivity of probe hybridization is limited.
Serial Analysis of Gene Expression (SAGE) represents a unique strategy to identify the existence of transcripts and quantify them by counting a small tag for each transcript molecule in a complex transcriptome. Three principles underlie the SAGE methodology: (1) A short sequence tag (10-14 bp) contains sufficient information to uniquely identify a transcript provided that that the tag is obtained from a unique position within each transcript; (2) Sequence tags can be linked together to from long serial molecules that can be cloned and sequenced; and (3) Quantitation of the number of times a particular tag is observed provides the expression level of the corresponding transcript. The unique feature of SAGE is that a 14-bp sequence is enough to be transcript specific and small tags from each transcript can be extracted and concatenated into larger pieces for efficient sequencing analysis. Because of all transcripts are represented by small tags in same size, there is no discrimination in SAGE tag cloning. Essentially all transcripts should be represented in SAGE tags.
SAGE is described in U.S. Pat. Nos. 5,695,937, 5,866,330 and 6,383,743, and is illustrated in FIGS. 1A and 1B, and also in Velculescu, V. E., Zhang, L., Vogelstein, B., and Kinzler, K. W. (1995). Serial Analysis Of Gene Expression. Science 270, 484-487, as well as Velculescu, V. E., Zhang, L., Zhou, W., Vogelstein, J., Basrai, M. A., Bassett, D. E., Hieter, P., Vogelstein, B., and Kinzler, K. W. (1997). Characterization of the yeast transcriptome. Cell 88. A number of websites devoted to SAGE may also be consulted for teachings on this technique, including the Sagenet website and Sagemap on the National Center for Biotechnology Information's website (a public gene expression data repository and online data access and analysis site, see Lash A E, Tolstoshev C M, Wagner L, Schuler G D, Strausberg R L, Riggins G J, Altschul S F. (2000) SAGEmap: a public gene expression resource. Genome Res 2000 July; 10(7):1051-60). Other websites describing SAGE and its uses may be found by conducting an internet search for Serial Analysis of Gene Expression.
http://www.google.com/search?sourceid=navclient&ie=UTF-8&oe=UTF-8&q=Serial+Analysis+of+Gene+Expression
In brief, mRNA is obtained from a cell or tissue, and reverse transcribed to obtain cDNA (see FIG. 1). The cDNA is then cleaved by a first restriction enzyme (the “Anchoring Enzyme”, typically a 4-base cutter), and the 3′ end of the cDNA is anchored to a bead. The beads are optionally divided into two pools, and the cDNA attached to the beads is ligated to two sets of adaptors or linkers. Each of these adaptors comprises a defined nucleotide sequence for PCR priming and amplification at its 5′ end, as well as a recognition site for a type IIS enzyme (the “Tagging Enzyme”, for example BsmFI and FokI), which directs cleavage by the enzyme at a position 3′ downstream of the recognition site. The tags are released by cleavage with the relevant Tagging Enzyme, and ligated together end to end to form ditags. The ditags are then amplified using PCR, digested with the Anchoring Enzyme, and ligated together to form concatamers. Sequencing of the concatamers reveals the identity and frequency of the tags, and provides expression data for the various genes which are transcribed in the cell or tissue. Due to the efficiency of sequencing small tags, SAGE has the potential to capture all expressed transcripts.
Despite these promises however, the original SAGE tags are too short for direct mapping to complex genomes. The 14-bp tags are only reliable for mapping onto existing EST or cDNA sequences in databases, or small genomes such as yeast. This shortcoming limits the application of SAGE to use only as an expression profiling tool, not for genome annotation. To overcome this problem, the developers of the original method managed to make the SAGE tags longer, simply by taking the advantage of a new IIS enzyme MmeI that cleaves DNA 20 base pairs away from its recognition site. The modified method is known as LongSAGE, and is described in WO 02/10438. This modification makes the LongSAGE tags specific enough to be directly mapped onto human chromosome sequences (Table 1 in Appendix B). This function is an important addition because new SAGE tags now can be directly marked onto specific chromosome locations for potential new genes or exons identifications, therefore, facilitating genome annotations.
However, despite its advantages, LongSAGE still has limitations to its effective use. Like the original SAGE, LongSAGE tags are extracted randomly depending on where the NlaIII sites are in a transcript sequence, so providing only ‘internal’ sequence clues for new transcripts. Furthermore, the new SAGE tags identified have to go through very tedious and long process such as 5′ and 3′ RACE to extend the information about the existence and characteristics of new genes. Finally, only one tag is generated for each expressed sequence, and it is not possible with the prior art methods to obtain further sequence information for the expressed gene readily.
The present invention seeks to solve these and other problems in the prior art techniques for expression analysis.
Citation or identification of any document in this application is not an admission that such document is available as prior art to the present invention.