The rapidly increasing amounts of genetic sequence information available represent a as constant challenge to developers of hardware and software database searching and handling. The size of the GenBank/EMBL/DDBJ nucleotide database is now doubling at least every 15 months (Benson et al. 2000). The rapid expansion of the genetic sequence information is probably exceeding the growth in computing power available at a constant cost, in spite of the fact that computing resources also have been increasing exponentially for many years. If this trend continues, increasingly longer time or increasingly more expensive computers will be needed to search the entire database.
Searching databases for sequences similar to a given sequence is one of the most fundamental and important tools for predicting structural and functional properties of uncharacterised proteins. The availability of good tools for performing these searches is hence important. When looking for sequences in a database similar to a given query sequence, the search programs compute an alignment score for every sequence in the database. This score represents the degree of similarity between the query and database sequence. The score is calculated from the alignment of the two sequences, and is based on a substitution score matrix and a gap penalty function. A dynamic programming algorithm for computing the optimal local alignment score was first described by Smith and Waterman (1981), improved by Gotoh (1982) for linear gap penalty functions, and optimised by Green (1993).
Database searches using the optimal algorithm are unfortunately quite slow on ordinary computers, so many heuristic alternatives have been developed, such as FASTA (Pearson and Lipman, 1988) and BLAST (Altschul et al., 1990; Altschul et al., 1997). These methods have reduced the running time by a factor of up to 40 compared to the best-known Smith-Waterman implementation on non-parallel general-purpose computers, however, at the expense of sensitivity. Because of the loss of sensitivity, some distantly related sequences might not be detected in a search using the heuristic algorithms.
Due to the demand for both fast and sensitive searches, much effort has been made to produce fast implementations of the Smith-Waterman method. Several special-purpose hardware solutions have been developed with parallel processing capabilities (Hughey, 1996), such as Paracel's GeneMatcher, Compugen's Bioccelerator and TimeLogic's DeCypher. These machines are able to process more than 2 000 million matrix cells per second, and can be expanded to reach much higher speeds. However, such machines are very expensive and cannot readily be exploited by ordinary users. Some hardware implementations of the Smith-Waterman algorithm are described in patent publications, for instance U.S. Pat. Nos. 5,553,272, 5,632,041, 5,706,498, 5,964,860 and 6,112,288.
A more general form of parallel processing capability is available using Single-Instruction Multiple-Data (SIMD) technology. A SIMD computer is able to perform the same operation (logical, arithmetic or other) on several independent data sources in parallel. It is possible to exploit this by dividing wide registers into smaller units in the form of micro parallelism (also known as SIMD within a register—SWAR). However, modern microprocessors have added special registers and instructions to make the SIMD technology easier to use. With the introduction of the Pentium MMX (MultiMedia eXtensions) microprocessor in 1997, Intel made computing with SIMD technology available in a general-purpose microprocessor in the most widely used computer architecture—the industry standard PC. The technology is also available in the Pentium II and has been extended in the Pentium III under the name of SSE (Streaming SIMD Extensions) (Intel, 1999). Further extension of this technology has been announced for the Pentium 4 processor (also known as Willamette) under the name SSE2 (Streaming SIMD extensions 2) (Intel 2000). The MMX/SSE/SSE2 instruction sets include arithmetic (add, subtract, multiply, min, max, average, compare), logical (and, or, xor, not) and other instructions (shift, pack, unpack) that may operate on integer or floating-point numbers. This technology is primarily designed for speeding up digital signal processing applications like sound, images and video, but seems suitable also for genetic sequence comparisons. Several other microprocessors with SIMD technology are or will be made available in the near future, as shown in table 1 (Dubey, 1998).
TABLE 1Examples of microprocessors with SIMD technologyManufacturerMicroprocessorName of technologyAMDK6/K6-2/K6-IIIMMX/3DNow!Athlon/DuronExtended MMX/3DNow!CompaqAlphaMVI (Motion Video Instruction)(Digital)HewlettPA-RISCMAX(−2) (Multimedia AccelerationPackard (HP)eXtensions)HP/IntelItaniumSSE (Streaming SIMD Extensions)?(Merced)IntelPentiumMMX (MultiMedia eXtensions)MMX/IIPentium IIISSE (Streaming SIMD Extensions)Pentium 4SSE2 (Streaming SIMD Extensions 2)MotorolaPowerPC G4Velocity Engine (AltiVec)SGIMIPSMDMX (MIPS Digital MediaeXtensions)SunSPARCVIS (Visual Instruction Set)
Several investigators have used SIMD technology to speed up the Smith-Waterman algorithm, but the increase in speed relative to the best non-parallel implementations have been limited.
The general dynamic programming algorithm for optimal local alignment score computation was initially described by Smith and Waterman (1981).
Gotoh (1982) described an implementation of this algorithm with affined gap penalties, where the gap penalty for a gap of size k is equal to q+rk, where q is the gap open penalty and r is the gap extension penalty. Under these restrictions the running time of the algorithm was reduced to be proportional to the product of the lengths of the two sequences.
Green (1993) wrote the SWAT program and applied some optimisations to the algorithm of Gotoh to achieve a speed-up of a factor of about two relative to a straightforward implementation. The SWAT-optimisations have also been incorporated into the SSEARCH program of Pearson (1991).
The Smith-Waterman algorithm has been implemented for several different SIMD computers. Sturrock and Collins (1993) implemented the Smith-Waterman algorithm for the MasPar family of parallel computers, in a program called MPsrch. This solution achieved a speed of up to 130 million matrix cells per second on a MasPar MP-1 computer with 4096 CPUs and up to 1 500 million matrix cells per second on a MasPar MP-2 with 16384 CPUs. Brutlag et al. (1993) also implemented the Smith-Waterman algorithm on the MasPar computers in a program called BLAZE.
Alpern et al. (1995) presented several ways to speed up the Smith-Waterman algorithm including a parallel implementation utilising micro parallelism by dividing the 64-bit wide Z-buffer registers of the Intel Paragon i860 processors into 4 parts. With this approach they could compare the query sequence with four different database sequences simultaneously. They achieved more than a fivefold speedup over a conventional implementation.
Wozniak (1997) presented a way to implement the Smith-Waterman algorithm using the VIS (Visual Instruction Set) technology of Sun UltraSPARC microprocessors. This implementation reached a speed of over 18 million matrix cells per second on a 167 MHz UltraSPARC microprocessor. According to Wozniak (1997), this represents a speedup of a factor of about 2 relative to the same algorithm implemented with integer instructions on the same machine.
Taylor (1998 and 1999) applied the MMX technology to the Smith-Waterman algorithm and achieved a speed of 6.6 million cell updates per second on an Intel Pentium III 500 MHz microprocessor.
Sturrock and Collins (2000) have implemented the Smith-Waterman algorithm using SIMD on Alpha microprocessors. However no details of their method has been published. They have achieved a speed of about 53 million cell updates per second using affine gap penalties. It is unknown exactly what computer this system is running on.
Recently, Barton et al. (2000) employed MMX technology to speed up their SCANPS implementation of the Smith-Waterman algorithm. They claim a speed of 71 million cell updates per second on a Intel Pentium III 650 MHz microprocessor. Only a poster abstract without any details of their implementation is currently available.