For treatment of cancers, lifestyle-related diseases, genetic diseases, and the like, it is necessary to inspect the genetic backgrounds of individual patients to select suitable treatment or predict prognoses for them, as so-called individualized medicine. Therefore, DNA (deoxyribonucleic acid) sequences, such as the genome or transcriptome, are analyzed. A DNA sequencing device used therefor is able to obtain only a short, fragmented DNA sequence. Therefore, it is necessary to perform data processing for inspecting from which part of the genome the obtained fragmented sequence derives by comparing it with a very long reference genome sequence, and further inspecting the presence or absence of variations, such as SNP (Single Nucleotide Polymorphisms), insertion, or deletion, contained in the sequence. Such data processing is typically called a mapping process.
When a massively parallel DNA sequencer called next-generation DNA sequencer is used, hundreds of millions of fragmented sequences (read sequences) each having a length of about 100 bases, which is relatively short, are obtained in a single measurement. When a human is a test subject, a reference genome sequence used therefor is as long as about 3 giga bases (3 billion bases). In a mapping process, such read sequences are compared with the reference genome sequence one by one to identify the corresponding positions and identify variations contained therein. As such a process requires a huge computational cost, dedicated, efficient algorithms have been developed and used. A representative method includes creating a database of a reference genome sequence through BWT (Burrows-Wheeler Transformation) (see Non Patent Literature 1), and performing a search using as a search key a short base sequence in a read sequence, and performing alignment in a region around (i.e., preceding and following) the matched region, taking into consideration possible sequencing errors and variations (see Non Patent Literature 2).
Typically, a next-generation DNA sequencer involves read errors of about 1%. Further, a very long genome region contains a number of similar sequences dispersed therein. Therefore, there is a possibility that a result of a mapping process performed on a per-read-sequence basis may contain errors. For example, there are cases where, although a given read sequence has no completely matching region in the reference genome sequence, it has a plurality of matching genome regions if it is supposed that there is a small number of sequencing errors. In such a case, which region is selected is at one's discretion, and the determination depends on the heuristic of the mapping process. Thus, in order to accurately analyze variations, a re-mapping process, which includes comparing mapping results of a number of read sequences and determining the majority, is performed in the following process, that is, in the downstream process (Non Patent Literature 3). Therefore, when the whole genome is analyzed, an amount of sequences that can cover several tens of times of the whole genome (which is greater than or equal to tens of giga bases) is determined. In addition, when a mapping destination is selected at one's discretion, it is concerned that a bias that depends on the mapping process may occur. Thus, results obtained by a plurality of mapping tools are compared with one another to check for the presence of such a bias. Patent Literature 1 is given as an example of patent literature related to such techniques.