The current IT market trends are changing in the order of Google, Facebook, Amazon, cloud computing and Ubiquitous, and at the same time, biomedical, bioinformatics and genomics are also changing according to new trends in the order of bio-Google, system bio, personalized medicine and precision medicine. Particularly, in the Post-Human Genome Project era, the next generation sequencing technology has been developed rapidly and efforts have been actively made to realize individualized/personalized medicine.
Currently, the next generation sequencing technology is known to take about one week to sequence (decode) and analyze the entire genome of a person (x30). In addition, it was reported that about 100,000 next-generation sequencers were supplied worldwide, and it was that a significant amount of money has been invested in major companies which have developed the third-generation sequencer (Ion Torrent: 2.5 generation; Pacific BioScience: third generation).
In addition, this field is the fastest advancing and developing field among all businesses in the world. As this trend progresses, the cost for sequencing and analyzing the entire genome of a person is expected to decrease to less than approximately $1,000 within the next two to three years. The most useful and immediately practicable technologies based on the above next generation technologies are clinical genomics, pharmaco-genomics and translational medicine. In addition, such clinical genomics has recently been applied to medical genomics, and such medical genomics, along with patient stratification technologies, have created a new discipline and new language called Precision Medicine mentioned by U.S. President Obama.
As described above, information on genetic variation is increasing every year, and the area of analysis accuracy will be continuously expanded by expansion of verified data according to the present invention.
Meanwhile, the applicant has continued to develop technology in order to improve the technical requirements of the above-mentioned genetic analysis field.
As a result of these efforts, the applicant has developed methods for precision medicine, clinical information, proteome and genome information related to bio-big data, and construction of analysis systems for increasing the analysis speed thereof. In particular, the applicant developed a GPU (graphic process unit)-based analysis system for analysis speed (Korean Patent No. 10-0996443), and developed information searching methods based on characteristic files of an RVR (records virtual rack) analysis tool which is a technique for increasing data comparison speed (Korean Patent Nos. 10-0880531, 10-1035959 and 10-1117603).
In addition, the applicant applied RVR and GPU (graphic process unit) to proteomes (Korean Patent No. 10-1400717), and developed allele depth-based ADISCAN analysis tools for efficiently determining variant calling and the level of rare variation between a control and an individual genome (Korean Patent No. 10-1460520, 10-1542529 and 10-2014-0020738).
In addition, the applicant developed methods for construction of an integrated genome DB for efficiently managing genome information, identification of mutations for disease causes, and genotype calculation for patient stratification (Korean Patent Nos. 10-2015-0187554, 10-2015-0187556 and 10-2015-0187559), and a method for computing human haplotyping from genome information (Korean Patent Application No. 10-2016-0096996).
In addition, using middleware specialized for storage of big data such as integrated genetic DB, MAHA supercomputing systems were developed which enables thousands of genomic bulk data to be analyzed simultaneously in a parallel distributed environment developed by the Electronics and Telecommunications Research Institute (ETRI) (Korean Patent Nos. 10-1460520, 10-1010219, 10-0956637, 10-093623, 10-2013-0005685, 10-2012-0146892 and 10-2013-0004519).
Using the MAHA system provided from the Electronics and Telecommunications Research Institute, the applicant has developed the first domestic supercomputing system, which has an optimized environment utilizing bio big data for clinical applications and is integrated with an integrated genome analysis system for precision medicine implementation.
In particular, although MAHA-Fs (a storage system for ultrahigh speed I/O for bulk data such as genome) was tailored to a common cloud computing environment, the applicant has developed MAHA-FsDx, which can be used for diagnosis in a clinical environment, that is, a hospital, by clearly defining reproducibility, precision and system limitations. In addition, Prior Art Documents (001) to (019) summarize the technical elements for a personal genome map-based personalized medical analysis platform.
NPs that account for more than 0.1% of the human genome sequence have been the subjects for correlating human phenotypic variations. Accordingly, various platforms for performing haplotyping in an accurate and rapid manner have been studied.
Here, the haplotyping may also be performed on the entire human genome, but currently, it is generally performed on specific SNP regions for promptness and accuracy of typing.
This is because the accuracy of haplotyping results increases as more human genome references are secured, but so far, references have been secured so that reliability can be ensured only for a reference of a specific SNP region.
The haplotyping may be performed on various SNP regions, but in recent years, it has been most actively used in the field of HLA typing for human leukocyte antigen genes.
Meanwhile, a general haplotyping process is schematically shown in FIG. 1. As shown therein, a BAM filed is generated from a DNA sample to be analyzed, and a specific region to be analyzed is extracted therefrom, thereby generating a Fastq format file.
Next, the Fastq format file is compared with haplotype allele references stored in a database, and the genotype of the DNA to be analyzed is read.
This haplotyping technology is also applied to HLA typing in which a specific region is limited to HLA gene.
Methods and technologies, which have recently studied on the HLA typing, are disclosed below.
However, the prior art as described above has the following problems.
That is, haplotyping according to the prior art has a problem that it is difficult to obtain accurate test results, because of the high polymorphism, linkage disequilibrium and sequence similarity of human genes.
In order to overcome this problem occurring when the prior art is used, the length of sequence reads should be increased. In this case, however, there are problems in that the analysis time increases and the analysis process becomes complicated, resulting in a decrease in analysis efficiency.