Next generation sequencing is maturing into a reliable diagnostics tool for widespread use. Sequencing technologies are leading to faster and cheaper sequencing and the work-flows have become more well defined. Within the next years, thousands to millions of human genomes will be completely sequenced and there will be an urgent need for analysis.
On the laboratory side, this development is made possible by dramatically improving throughput of sequencing machines that spit out sequencing data at ever higher rates and ever lower cost—in the last years the costs per sequenced base pair kept halving in periods of less than 6 month.
This progress is much faster than Moore's law for the cost of computing power which states halving intervals around 18 months. This moved into the focus that computation could become a severe bottleneck. The computationally most expensive and data intensive part of sequencing is aligning short imperfect reads (pieces of the genome of length≈100 base pairs) to a reference genome (≈3×109 base pairs), i.e., given a read, to have to find where it best fits the reference genome and how it can be aligned by performing a small number of edits to account for reading errors and mutations (finding and scoring gaps in the alignment). This is a challenge, because it is not possible to have a priori information about the correct position. One also has to be aware that the computations cannot fully profit from Moore's law if they cannot exploit parallel processing and the memory hierarchy. In particular, sophisticated index data structures such as suffix arrays and suffix trees are difficult to construct in parallel and querying them imposes many cache faults. Originally, it has been suggested to use experience on parallel and memory hierarchy aware implementation of such data structures to attack this particular problem. But a closer analysis of the problem showed that with much simpler techniques it can get also much faster.