Proteins and nucleic acids are biological macromolecules that are found in abundance in living organisms, where they function in encoding, transmitting and expressing genetic information. Nucleic acids include DNA (deoxyribonucleic acid) and RNA (ribonucleic acid). For instance, DNA encodes information for building of proteins.
Oftentimes, it is desired to perform local sequence alignment, where similar regions between two nucleotide sequences or protein sequences are identified. Nucleotide or protein sequence alignment can involve solving an approximate string alignment problem for a given cost matrix. Given a database sequence, a query sequence, and a cost function that models biological similarity between sequences, sequence alignment can be performed to find a substring of the database sequence that matches the query sequence.
The Smith-Waterman algorithm is a sequential algorithm based on dynamic programming for performing sequence alignment. The Smith-Waterman algorithm can generate a match, while being inherently sequential. The runtime cost of the Smith-Waterman algorithm can be proportional to the product of a database sequence length and a query sequence length. Thus, the runtime cost can cause the Smith-Waterman algorithm to oftentimes be impractical to implement as the database sequence length increases (e.g., for large genomes). Accordingly, various heuristic based approaches that attempt to find approximate matches have been developed. These conventional heuristic based approaches, however, are commonly less accurate (e.g., miss matches).