Research in many areas of biotechnology, such as drug discovery, disease analysis, and crop improvements, involves matching fragments of sequences that are generated from a sample, with long sequences that represent biological components. For example, the human genome is a double helix of sequences of nucleotides, where one sequence is a complement of the other. There are 4 nucleotides in DNA sequences, more in RNA and protein sequences. Each nucleotide in a DNA sequence is represented by a character from the alphabet {a,c,t,g}. Each strand of the double helix that constitutes the human genome is a sequence of more than 3 billion characters. The 2 strands are not independent. For each character position in one strand, there is a complementary character in the corresponding position of the second strand. The characters a and t are complements of one another, and the characters c and g are complements of one another. Each character is also called a base, hence a genomic sequence is often said to contain more than 3 billion base-pairs (a base and its complement).
An average laboratory test of a DNA sample takes hours to days, and typically generates over 100 million test fragments, each having a small number of letters, usually between 36 and 500 letters. Although test fragments are in the form of a single strand, a test fragment must be matched against both strands because it is unknown from which strand the test fragment originates. Typically, the test fragments are individually matched to selected chromosomes in a process that may take many hours or days.
Matching between the fragments and the chromosomes typically allows for 1 or 2 errors in the match, such as mismatched letters or surplus characters in the fragment, to account for experimental errors. Some algorithms that perform fragment matching while allowing for errors are called BLAST algorithms. BLAST algorithms utilize in-memory analysis and are computationally intensive. Thus, they are generally inadequate for the high numbers of fragments that are generated for matching.