DNA sequences of entire genomes of different species are being determined at a rapid rate. It is incumbent on the bioinformatics community to understand these genomic structural variations and functions. Also, some finished versions of genome data contain gaps where data could not be acquired. These drafts of various genomic sequence data may consist of pieces of data whose relative order and orientation are difficult to determine. Dealing with such incomplete data places new demands upon integrative systems tools, particularly when two or more genomes are being compared. The bioinformatics community needs to be able to handle gaps more effectively.
In conventional approaches, handling comparisons across genomes is a major problem. For extremely similar sequences, there exist so called “greedy” alignment methods that compute optimal alignments. These algorithms allow gaps in the alignments and are extremely efficient, but work well only for very simple alignment scoring schemes. For richer scores (involved in large stretches of a single genome and comparing multiple genomes), these greedy methods lose their efficiency edge over dynamic programming.
Conventional alignment methods for three or more sequences are almost entirely geared toward comparison of protein sequences based on putative codons, sets of three nucleic acid bases encoding a single amino acid. This may be due to the fact that few examples exist of genomic sequence data from several similar species. Also, sequence comparisons and homology analyses are done on a binary basis. This conserves computational resources, but ignores biochemical information.
There is a need for an improved solution that overcomes shortcomings of conventional sequence alignment similarity and gene sequence comparison tools.