Since the completion of the genome sequences for human and several other organisms, attention has been drawn towards annotation of genomes for functional elements including gene coding transcript units and regulatory cis-acting elements that modulate gene expression levels.
One of the major challenges is the identification of all genes and all transcripts expressed from the genes in human and model organisms. In the annotation of genes, full-length cDNA cloning and sequencing is the most conclusive and is viewed as the gold standard for the analysis of transcripts. However, this approach is expensive and slow when applied to a large number of transcripts across a large number of species and biological conditions. There are short tag based approaches such as SAGE (serial analysis of gene expression) and MPSS (massively parallel signature sequence). These short tag based methods extract a 14-20 bp signature for representing each transcript. The traditional SAGE approach, however, relies on the presence of restriction enzyme (RE) recognition sites, such as NlallI, and lacks the capability of defining gene boundaries in the genome. Further, the specificity of the tags is often poor and the information yielded regarding transcript structures is frequently incomplete and ambiguous.
Gene Identification Signature (GIS) analysis, or Paired-End diTag (PET) analysis, is a new methodology which can precisely identify the transcription start sites (TSS) (also indicated as transcription initiation site (TIS)) and polyadenylation sites (PAS) of expressed genes in the genome to facilitate genome-wide transcriptome profiling (US 2005/0059022). The GIS (or PET) analysis was developed as a modification of the 5′ LongSAGE (5′LS) and 3′ LongSAGE (3′LS) analysis method (Wei, C-L., Ng, P., Chiu, K. P., Wong, C. H., Ang, C. C., Lipovich, L., Liu, E., and Ruan Y., 2004, 5′ LongSAGE and 3′ LongSAGE for transcriptome characterization and genome annotation. Proc. Natl. Acad. Sci. USA 101, 11701-11706). Starting with full-length cDNA clones, GIS links the first ˜18 bp (5′ tag) with the last ˜18 bp (3′ tag) of each full-length cDNA molecule in the same order and orientation—size variation is caused by the natural imprecision of Typell restriction enzyme digestion—in such a way that the strand, order (5′ followed by 3′) and orientation are maintained. In such a way, libraries comprising GlSditags (also referred to as PETs, GIS ditags or ditags) are prepared and sequenced. However, at present no efficient methods for the identification of GISditag sequences from these libraries, as well as the construction of GISditag databases, have been disclosed.
The GISditags are required to be mapped to find their corresponding genes on the genome. However, no mapping methods have been specifically disclosed for GISditags. Further, there are no existing computational algorithms that are readily applicable for mapping the GISditag sequences to the genome. In the past, SAGE and MPSS tags were matched to the tag-gene pairs in a virtual database generated from known sequences. With this approach, novel transcripts that did not exist in virtual databases would not be mapped. The two most often used sequence alignment tools are BLAST (basic local alignment search tool) and BLAT (BLAST-like alignment tool). However, they are not designed for short tag sequences. Further, BLAT often leads to poor or incorrect results, while BLAST requires a long time and is thus not suited for large-scale mapping.
There is therefore a need in this field of technology for new methods and systems for the organization and analysis of GISditag data, as well as efficient methods and systems for mapping ditags to genome.