In recent years, a massively parallel DNA sequencer based on a new principle completely different from that of a conventional capillary DNA sequencer has appeared (Non Patent Literature 1). The massively parallel DNA sequencer can read a lot, or tens of millions, of sequences at one time. However, there is a disadvantage in that the read sequence length is tens of nucleotides, which is relatively short. Thus, in order to compensate the disadvantage, a paired-end method is employed (Non Patent Literature 3). The paired-end method sequences tens of nucleotides from both ends of a multiplicity of genome sequence fragments approximately controlled to a certain length (from hundreds of nucleotides to thousands of nucleotides), thereby allowing information of pairs of nucleotide sequences (paired-end sequences) about a certain interval apart from each other on genomes to be acquired.
A genome mapping computation is performed on paired-end sequence data, acquired by a sequencer analyzing a genome DNA sample, with respect to a reference genome sequence. That is, a computation is made to determine at which positions the sequence acquired by sequencing appear on the reference genome sequence, and it is checked whether a distance of an expected separation between mapping positions of paired sequences on the reference genome is kept or not. This allows structural variations, such as insertions and deletions between the sample genome and the reference genome, to be detected. That is, if the distance between the mapping positions is larger than the expected separation, it is considered that deletions have occurred between the pair of sequences on a sample genome side. In contrast, if the distance between the mapping positions is smaller than the expected separation, it is considered that insertions have occurred between the pair of sequences on the sample genome side (Non Patent Literature 3).
If the sequence length is short, there may be a case where the mapping positions are not uniquely determined in the genome mapping computation and a lot of positions are listed as candidates. It can be expected that, even in such a case, use of the mapping position separation of the paired sequences as a constraint, that is, use of the constraint that the mapping positions are in close proximity to each other, uniquely determines the mapping positions as the pair sequences or narrows down the candidates to a small number of candidates. As to paired-end sequence data acquired by the sequencer analyzing a transcription product sample, there may be a case where an intron is intervened between the mapping positions of the paired sequences and the separation between the mapping positions may become longer by the intron length. Even in this case, the constraint that the separation between the mapping positions does not exceed the length of gene region on the genome can be used for mapping computation as paired sequences.
In a case of performing genome mapping computation of massive sequence data (query sequence data) on a large-scale reference genome sequence, typically reference genome sequence data is preliminarily indexed. Use of the index can speed up retrieval of the query sequence. A suffix array (Non Patent Literature 2) can be employed as an indexing method.
MAQ has been known as software capable of mapping massive paired sequences data of the massively parallel DNA sequencer analyzed by the paired-end method, on the reference genome sequence at high speed and with high accuracy (Non Patent Literature 6). MAQ preliminarily creates index information of massive paired sequences data, and retrieves candidates of mapping positions while scanning the reference genome sequence and referring to index information. In order to consider a condition that paired sequences are mapped in close proximity to each other while satisfying a distance constraint, up to the last two candidates of positions to be mapped on the plus strand of a genome sequence per query sequence are held in storing region of a computer in a scanning process and at the same time, candidates of positions to be mapped on the minus strand are retrieved, and it is determined whether a pair with mapping positions on the plus strand that hold the found mapping positions on the minus strand and satisfying the constraint on distance can be created or not. This efficiently evaluates combinations of candidates of mapping positions of the paired sequences.