In the 1990s, genome projects were launched for the purpose of understanding the principles of organisms, further studying diseases of organisms, and studying the origins and evolution by analyzing the whole genetic information included in DNA and RNA possessed by an organism, namely, the whole genomic base sequence.
A genomic base sequence to be analyzed contains vast quantities of data per one sample. In recent years, a sequence decoding device, which is called a next generation sequencer, capable of decoding a genomic base sequence at very high speed and at low cost has been developed and brought into practice.
The next generation sequencer reads DNA or RNA at high speed by fragmenting DNA or RNA to be analyzed into very short fragments (named, short reads) and reading these fragments in parallel, and analyzes each of the read fragments to determine a base sequence of each fragment. Thereafter, the determined base sequence information of each fragment is output as sequence data called a read sequence, for example, data in the FASTQ format. Alternatively, data obtained by aligning (mapping) the read sequence to a known genomic base sequence (hereinafter, also referred to as “reference sequence”), for example, data in the SAM format or the BAM format is output (see Patent Literature 1, for example).
Patent Literature 1 discloses the technique of enabling high quality alignment by the step of specifying a plurality of high quality read sequences from a plurality of read sequences, the step of extracting a plurality of unique read sequences from the plurality of high quality read sequences, and the step of comparing the plurality of unique read sequences with a reference sequence corresponding to a reference sample.
The data in the FASTQ format, SAM format, BAM format, or the like of a chromosome sample (hereinafter, also referred to as “genome data” as a general term) output by the next generation sequencer is utilized for various analyses including ChIP-Seq (Chromatin Immunoprecipitation-sequence) and RNA-Seq.
Meanwhile, a visualization technique that enables analytical results of ChIP-Seq, RNA-Seq, and the like, and genomic base sequences to be visually grasped is also developed. Examples of the visualization technique include viewers such as Integrative Genomics Viewer (Broad Institute in the U.S.), Integrated Genome Browser (Affymetrix in the U.S.), UCSC Genome Browser (University of California, Santa Cruz in the U.S.), Gbrowse and the like can be recited.
According to these visualization techniques, it is possible to visually compare the commonness, difference, or the like between the reference sequence and the genomic base sequence that is reconstructed by assembling a large number of read sequences.