Field of the Disclosure
The present disclosure relates to DNA read mapping. More specifically the present disclosure relates to DNA read mapping utilizing cloud computing, for example a commercial cloud computing service.
Description of the Related Art
The rapid advance of human genomics technologies has not only revolutionized life science but also profoundly impacted the development of computing technologies. At the core of this scientific revolution is the emergence of the high-throughput sequencing technologies (often collectively referred to as Next Generation Sequencing (NGS)). Today, a single sequencer can generate millions of short DNA sequences (called reads), each read comprising a 30 to 120 base-pair long sequence of a genome having over a billion nucleotides. To interpret read sequences, the reads are aligned with publicly available human DNA sequences (called reference genomes). The positions of the reads (within a reference genome) and other features (e.g., whether the sequence is of a human or microbes associated with a human) are thereby able to be identified in this step, known as read mapping.
Read mapping is, in general, a prerequisite for most DNA sequence analysis, and is an important analysis for sequencing human DNA. The analysis, in general, involves intensive computation given the huge size of the reference genome (6 billion nucleotides) and the complexity of the mapping operation. Read mapping includes calculating edit distances between reads and all the substrings on the reference genome. As such, read mapping is time and labor intensive and often expensive.
With the fast-growing sequence data produced by NGS, the demands for mapping such data are increasingly hard to be met by the computing power within organizations. To meet this demand, outsourcing read mapping to low-cost commercial clouds, for example Amazon Elastic Compute Cloud (EC2) which can process terabytes of data at a low price (e.g., 0.1 dollar per CPU hour), is one option previously considered for handling this large, data sensitive task. However, commercial cloud outsourcing creates a serious privacy risk regarding sequence information and identity information of the sequence donors which may lead to denied access to health/life/disability insurance and educational/employment opportunities. Previously explored commercial computing techniques for read mapping have lacked the capability of scalability of read mapping while protecting the identification information from attacks. In recognition of the short-fall of current options, in order to protect sequence donors, the National Institutes of Health (NIH) has thus far disallowed any datasets involving human DNA to be handed over to the public cloud.
Another previously explored avenue for addressing this problem includes secure computation outsourcing (SCO). However, existing approaches have thus far not been able to enable secure read mapping on a commercial cloud. Traditional techniques of SCO, such as homomorphic encryption, secret sharing, and secure multi-party computation (SMC), are too heavyweight to sustain a data intensive computation involving terabytes of data, that is to say that the computational time needed for processing each piece of data makes the SCO impractical for most application. For example, a privacy-preserving protocol previously proposed takes 3 minutes to calculate the edit distance between two 25-element sequences through homomorphic encryption and oblivious transfers. Other secret-sharing based approaches all require an immense amount of data exchanged between different share holders during a computation, and are therefore hard to scale. In addition, secret sharing techniques do not relieve the NIH of the above-mentioned legal burdens, which cloud providers are either unwilling to undertake or must significantly raise prices of services in response.