Since the completion of the genome sequences for human and several other organisms, attention has been drawn towards annotation of genomes for functional elements including gene coding transcript units and regulatory cis-acting elements that modulate gene expression levels.
Currently there are three main approaches for genome annotation. The first approach uses existing transcript data to identify gene-coding regions in the genomes, the second approach uses computational algorithms to statistically predict genes and regulatory elements and the third approach compares genomic sequences from other vertebrates for conserved regions based on the view that functional elements in genomes are conserved during evolution.
Despite considerable success, these approaches are unsatisfactory for determining the complete and precise content of all functional elements in the human genome. As a result, a complete list of genes in the human genome is still unavailable. In particular, all the low abundant and cell specific genes have not been identified. Many gene models suggest that the current genome annotation is incorrect, particularly regarding where the transcription starts and ends.
All the gene predictions have to be validated by experimental means and the prospective genes are required to be cloned in full-length for further functional studies. It is therefore clear that many challenges surround the field of human genome annotation.
One of the challenges is the identification of all genes and all transcripts expressed from the genes in human and model organisms. In the annotation of genes, full-length cDNA cloning and sequencing is the most conclusive and is viewed as the gold standard for the analysis of transcripts. However, this approach is expensive and slow when applied to a large number of transcripts across a large number of species and biological conditions. There are short tag based approaches such as SAGE (serial analysis of gene expression) and MPSS (massively parallel signature sequence). These short tag based methods extract a 14-20 bp signature for representing each transcript. Though this approach is efficient in tagging and counting transcripts in a given transcriptome, the specificity of the tags is often poor and the information yielded regarding transcript structures are frequently incomplete and ambiguous.
Gene Identification Signature (GIS) ditag sequences, obtained by extracting interlinked 5′ and 3′ ends of full-length cDNA clones into a ditag structure, provide substantial tag specificity. However, there are no existing computer algorithms that are readily applicable for mapping the GISditag sequences to genome. In the past, SAGE and MPSS tags were analyzed using a two-step approach. The tags were first matched to cDNA sequences and then to the genome. In this approach, novel transcripts that did not exist in cDNA databases would not be mapped. The two most often used sequence alignment tools, BLAST (basic local alignment search tool) and BLAT (BLAST-like alignment tool), are not designed for short tag sequences and often leads to poor or incorrect results.
Hence, this clearly affirms a need for an improved transcript mapping method.