Bio-informatics Sequence Search is a common task in drug discovery process. BLAST is one of the most widely used next generation sequencing research tools. BLAST performs a sequence similarity search and comparison using heuristic methods. There are challenges in scaling up Sequence Search tools like BLAST to handle large amounts of genome data and large number of concurrent requests while providing results in reasonable amount of time cost effectively.
Sequence search uses several tools like BLAST, BLAT etc. These tools are similar in architecture but implement different search algorithms. There are solutions that describe how to re-implement some of these algorithms like BLAST using frameworks like Map Reduce but it's difficult to re-implement and keep updating them as there are advances in those algorithm implementations. The current invention describes a solution for making sequences search tools faster, secure, cost effective using cloud computing infrastructure and techniques. The current invention uses BLAST as an example to describe the techniques used but they apply to any similar sequence search tool like BLAT.
BLAST is one of the most widely used next generation sequencing research tools. BLAST performs a sequence similarity search and comparison using heuristic methods. The heuristic method tries to create an alignment by finding the amount of local similarity. Identification of this local alignment between two sequences was proposed by Smith-Waterman. The BLAST heuristic finds short matches between two sequences and creates alignments from the matched hot spots. In addition, it also provides statistical data regarding the alignment including the ‘expect’ value or false-positive rate. Furthermore, the search heuristic also indexes the query and target sequence into words of a chosen size. The FASTA (Pearson and Lipman 1988) and NCBI BLAST mostly use this algorithm to provide fast and flexible alignments involving huge databases.
BLAST can be used in different ways, as standalone application or via web interface for comparison of an input query against a database of sequences. BLAST is a computationally intensive technique, through the computation contains embarrassingly parallel code. To exploit the inherent parallelism present the computation, researchers have made several parallelization attempts in order to process the massive data faster. For example, Soap-HT-BLAST, MPIBLAST, GridBLAST, WNDBLAST, Squid, ScalaBlast, GridWorm use an infrastructure model that focuses on low-level details such as MPI message-passing libraries or grid frameworks like Globus. However, their installation as well as maintenance is quite complicated. Y. Sun et. al. has implemented an ad-hoc grid solution of BLAST where the computation does not take place where the data resides. M. Gaggero et. al has used the core GSEA algorithm for parallel implementation of BLAST on top of Hadoop. BlastReduce, a parallel read mapping algorithm implemented on Java with Hadoop. which uses the Landauvishkin algorithm (seed and extend alignment algorithm) to optimize mapping of short reads. Twister BLAST is a parallel BLAST application based on Twister MapReduce framework. Yet another implementation called Biodoop, uses three algorithms BLAST, GSEA and GRAMMAR. CloudBlast is another popular implementation of BLAST that uses hadoop map-reduce framework for supporting BLAST on cloud platform and has been proved to give better performance over MPIBLAST. Azure BLAST is similar to Cloud Blast in computing style but supported by Azure Cloud Platform rather than Map-Reduce. Blast has also been ported on EC2-taskFarmer, Franklin-taskFarmer, and EC2-Hadoop. Blast has also been parallelized at the hardware level. The first hardware BLAST accelerator was reported by R. K. Singh. TimeLogic has commercialized an FPGA-based accelerator called the DeCypher BLAST hardware accelerator.
Ensembl is a joint project between EMBL-EBI and the Sanger Centre. Ensembl produces genome databases for vertebrates and other eukaryotic species and provides a web based solution for searching the genome sequences leveraging BLAST algorithm. Ensembl doesn't offer security for the search operations. Several pharmaceutical organizations are not able to use the sequence search services offered by Ensembl because they are concerned that their competitors will be able to eavesdrop on the sequence searches being performed by their scientists leading to loss of proprietary and confidential information. Another challenge with use of Ensembl is the performance is not predictable. As the number of concurrent requests increase, the sequence search operations performed through the Ensembl web application take more time leading to loss of productive time of the scientists thus resulting in delays of the drug discovery process and the consequential loss of revenues. The alternative for this is to host a mirror of Ensembl internally but that is not cost effective.
The existing sequence search solutions are not scalable, not cost effective, do not provide adequate security and features like public-private data interlinking for use in large pharmaceutical companies. The present technologies leverage a constant pool of infrastructure irrespective of the workloads.
Thus, there is a need to overcome the problems of the existing technologies. Therefore, the present inventors have developed a computer-implemented method, system and computer readable medium for providing a scalable bio-informatics sequence search on cloud, which would provide scalability, security, interlinking of public and private data sets, applying access controls, efficient partitioning of data and parallelization for faster sequence search processing and cost efficiency problems in bio-informatics sequence search.