Currently, there are three widely used general algorithms that are available for nucleotide and amino acid sequence similarity searching. These include Smith-Waterman, FASTA and BLAST. The latter is a simplification of the Smith-Waterman algorithm which is also known as the maximal segment pairs algorithm. The first step in FASTA and BLAST is a “word” search with a specific word size (usually two for proteins and six for nucleic acids). For each such “word” in both sequences, BLAST will compare them and assign that punctual match a score. BLAST allows for mismatches and ambiguity in comparisons. BLAST then tries to join words and find the maximal segment of contiguous matching words. This is called a maximal segment pair (MSP) and represents a matching region containing no gaps. The scores of each word are added and a global score for the MSP is computed. BLAST deals with each of these regions separately, i.e., BLAST does not allow gaps inside matches. As it finds matches, BLAST makes decisions on how to align them based on a statistical analysis of the sequences, discarding what it determines to be possibly a random or meaningless match. This speeds the analysis up, but has a detrimental consequence: matches of short or frequently appearing sequences may be lost, leading to potential off-target annealing sites within a selected oligonucleotide sequence.
FASTA works on a different set of assumptions than BLAST, and hence provides different results. FASTA speeds things up by comparing several residues at once. It looks for exact matches of this small number of residues (word) and does not consider ambiguity or approximate matches in the comparison. Once all word matches have been found, FASTA also tries to join them into regions. At the next stage FASTA takes the 10 best matching regions for each analyzed oligonucleotide and tries to join them into a bigger one even although they might be separated: FASTA selects the similarity region accommodating gaps, and computes an overall score for the match with the gaps. Finally, FASTA sorts sequences by the best similarity region (after joining matches with gaps) found and generates a better quality alignment using the Smith-Waterman algorithm to calculate a new and more accurate score. If this score exceeds a given threshold depending on its length, the sequence is considered an acceptable match. This means that just like BLAST, FASTA may reject some possibly biologically significant matches with low statistical scores. This may result in the loss of some significant matches in low complexity or very short sequences/matches/motifs. Both BLAST and FASTA are more reliable when working with relatively large polynucleotides, and less adequate while working on short polynucleotides.
The Smith-Waterman (SW) algorithm is more sensitive than either BLAST or FASTA. This is because BLAST and FASTA place additional restrictions on the alignments that they report in order to speed up their operation, but the Smith-Waterman places no restriction on the alignment it reports other than that it have a positive score in terms of the similarity table used to score the alignment. This makes the Smith-Waterman much more rigorous, but also more sensitive. Since SW searching is exhaustive, it is the slowest method. We offer an alternative to Smith-Waterman (SW) algorithm, that allow exhaustive cataloguing of the perfect matches of the meaningful length in any set of sequences of interest (e.g. human transcriptome).