1. Field of the Invention
The present invention relates to bioinformatics technology, and more particularly, to a system and method for efficiently managing a vast amount of read data and genetic information obtained from the read data.
2. Description of the Related Art
Recently, research and development in the genome technology of the bio-industry has been increasing. Major global companies such as “Genome Quest,” “Knome,” and “Complete Genomics” have commercialized DNA sequencing technology called “next-generation sequencing (NGS)” and are providing NGS services. In Korea, a company called “Tera-gen” has recently launched a similar service. The NGS technologies have many potential uses in various fields including a genome field for displaying data extracted through NGS, a bio-industry field for providing genome analysis services, a genome research field for providing data, and a medical field that utilizes genome data in diagnosis and treatment.
The amount of data obtained using next-generation DNA sequencing technology is vast. For example, approximately 3.5 billion pieces of data are obtained from one human sample. For efficient analysis, retrieval, and display of the obtained data, it is very important to develop database establishment and data processing technology (that is, genetic information management technology).
Conventional genetic information management technologies include sequence alignment/map (SAM) tools, generic genome browser (GBrowse), and integrative genomics viewer (IGV).
SAM tools were published in an academic journal “Bioinformatics” in 2009 and suggest a method of effectively storing read data obtained through NGS. In SAM tools, SAM and binary alignment/map (BAM) file formats are suggested. These file formats offer a way to reduce total data size and extract data within a short period of time.
In the SAM file format, a header section is indicated by character ‘@,’ and real data is tab-delimited and includes a total of eleven essential columns as shown in Table 1.
TABLE 1#NameDescription1QNAMEQuery NAME of the read of the read pair2FLAGbitwise FLAG (pairing, strand, mate strand, etc.)3RNAMEReference sequence NAME4POSI-based leftmost POSition of clipped alignment5MAPQMAPping Quality (Phred-scaled)6CIGARextended CIGAR string (operations: MIDNSHP)7MRNMMate Reference NaMe (‘=’ if same as RNAME)8MPOSI-based leftmost Mate POSition9ISIZEinferred Insert SIZE10SEQquery SEQuence on the same strand as the reference11QUALquery QUALity (ASCII-33 = Phred based quality)
The SAM format can be converted into the BAM format which is a binary format. This enables rapid extraction of information from data and reduction of data space. To actually access data, a specially designed program called “samtools” should be used.
GBrowse is a genetic information browser utilized by many research institutes worldwide. Although GBrowse is based on a database called MySQL, it can also manage file-based data. However, it cannot store/manage read data in the database. In addition, since the volume of NGS data is far larger than that of general genetic information, GBrowse cannot be applied as it is to the NGS data. Therefore, attempts were made in 2009 to display the NGS data, and as a result, GBrowse has been modified to be able to display the read data.
Lastly, IGV is a browser developed to display genetic information on a local computer. IGV is designed to include not only the NGS data but also experimental data such as a microarray. In the case of the NGS data, data in the SAM or BAM file format may be received as inputs. A user may install this tool in his or her computer, obtain a necessary file, and import the file to the tool.
The above conventional technologies have the following problems.
First, it is almost impossible to modify only a desired part of data used in the conventional technologies. To modify a part of the data, the entire data must be generated again. Thus, data should be generated in the SAM format and then converted into the BAM file which is a binary format. Therefore, any modification of data requires the entire file to be generated again.
Second, it is difficult with the conventional technologies to determine redundancy of data in the entire data when generating or additionally deleting the data. When a piece of data is modified, the entire data must be checked to find redundancy of the piece of data in the entire data. In addition, when necessary data is extracted from the entire data, the entire data must be checked to find redundancy of the extracted data.
Third, the conventional technologies are not intended for multiple users. Therefore, when multiple users simultaneously access the same data, the data accesses cannot be controlled, nor can necessary rules be applied. Hence, a specific program should be devised to handle the simultaneous same data accesses by the multiple users.
Fourth, data integrity processing is difficult with the conventional technologies. Data integrity is about preventing an unauthorized user from modifying or deleting data in an unauthorized way. For data integrity, each piece of data must be associated with a system account, or a special tool for managing the data must be developed. The security of genetic information of living things is very important. In particular, human genetic information must be protected with a higher level of security than that of resident registration numbers of individuals. Therefore, the difficulty of integrity processing can be pointed to as a clear problem.
Lastly, the conventional technologies do not have the function of recovering data when the data is damaged due to various reasons. Therefore, a data recovery related function must be implemented, or a data recovery related program must be operated. In industrial aspects, data stability is a crucial issue in addition to data integrity.