In the past decade, many genome projects have produced complete genomes for increasingly many organisms. Since 1999 many tools have proven effective in aligning large genomic sequences of two closely related organisms. Such tools include: MUMmer, which is described, e.g., in Delcher et al., “Alignment of whole genomes,” Nucleic Acids Res., pages 2369-2376 (1999); GLASS, which is described, e.g., in Pachter et al., “Human and mouse gene structure: Comparative analysis and application to exon prediction,” Genome Res., pages 50-958 (2000); AVID, which is described, e.g., in Bray et al., “Avid: A global alignment program,” Genome Res., pages 97-102 (2003); DIALIGN, which is described, e.g., in Morgenstern et al., “Exon discovery by genomic sequence alignment,” Bioinformatics, (6):777-787 (2002); LAGAN, which is described, e.g., in Brudno et al., “Lagan and multi-lagan: Efficient tools for large-scale multiple alignment of genomic dna,” Genome Res., pages 721-731 (2003); BLASTZ, which is described, e.g., in Schwartz et al., “Human-mouse alignments with blastz,” Genome Res., page 103-107 (2003); and BLAT, which is described, e.g., in Kent, “Blat—the blast-like alignment tool,” Genome Res., (4):656-664 (2002).
Characteristics common to many of these programs include: (i) an assumption that conserved regions of the sequences being aligned appear in the same order and orientation, which may be particularly likely for closely related organisms; (ii) the construction of tables of scores for matches and mismatches between amino acids or nucleotides, which may incorporate penalties for insertions or deletions, and which may be used to obtain mathematically ‘optimal’ alignments; and (iii) the search for exact or spaced exact matches (e.g., in local alignment programs), and the extension of local similarities in both directions in passes directed by specified scoring functions.
However, certain shortcomings may limit the use of many of these programs. First, genomic order and orientation may not be conserved between species of interest. Second, the scoring matrix (e.g., a PAM or a BLOSUM matrix) which may be most appropriate for aligning a set of sequences should preferably be determined by the level of relatedness of sequences. Hence, a pre-estimate of a percentage of similarity between two genomes may be required to choose a proper scoring matrix. Third, a variation in the rate of evolution across the genome can make it impractical to pick a universal scoring matrix or a set of gap costs as described, e.g., in Frazer et al., “Cross-species sequence comparisons: A review of methods and available resources,” Genome Res., pages 1-12 (2003). Finally, when using a “match and extend” strategy, many local procedures can pay a steep cost in extending short matches in both directions.
Thus, comparing vertebrate genomes can require efficient cross-species sequence alignment programs. It may be desirable to have a system and method for cross-species genome alignment which reduces the above-mentioned deficiencies.