1. Field of the Invention
The present invention relates to the field of aligning data sequences, and, more specifically, to a method and system for aligning multiple data sequences to achieve the best commonality of the sequences.
2. Discussion of the Prior Art
The problem of aligning a large number of sequences appears very often and in many different application areas (e.g. biology, data mining in databases, computer security etc.). It has been observed that most traditional methods of aligning multiple sequences do not work well on large number of sequences on the order of N&gt;10.
The multiple sequence alignment problem has been studied for at least the last fifteen years, in the context of Computational Biology, it is well known that this is a rather difficult problem. Contributing to the difficulty in solving the problem is that in its most general form, it is difficult to model to the satisfaction of Evolutionary Biologists, Geneticists and other users. The most popular and successful approach to date has been a dynamic programming technique using different mechanisms of scores that is a function of the edit distance along with gap penalties to evaluate the similarity of the different sequences. Dynamic programming relies on identifying good penalty scores for matches and mis-matches. This is difficult to realize in real-world applications. The method is best suited to small sequence numbers on the order of N&lt;6.
For the situation where there are more than two sequences to be aligned, one of the prior art approaches has been to perform a pairwise alignment where two of the N sequences are analyzed at a time and an N-wise alignment is built from the pairwise alignments. This approach works well for small values of N on the order of N&lt;6, however, for large values of N additional constraints are required to give meaningful alignments.
It is apparent that the problem of aligning multiple sequences where the number of sequences is large is computationally a very demanding one.