In many industries, information technology platforms for technical computing have become key differentiators and drivers of business growth. For companies in these industries including the biotechnology, pharmaceutical, geophysical, automotive, and aerospace industries, among others—business-critical technical computing tasks have become complex, often involving multiple computational steps with large quantities of data passing from one step to the next. In addition, the individual computational steps themselves are often computationally intensive and very time consuming.
In contemporary genomics, for example, researchers often query dozens of different databases using a variety of search and analysis tools. To handle these complex tasks efficiently and facilitate distributed processing, data may be moved from one analysis step to the next in “pipelined” or “workflow” environments that chain a series of connected tasks together to achieve a desired output. Researchers can use workflow environments to design, share, and execute workflows visually, as opposed to manual execution or the generation of a series of shell scripts. Such workflow environments can be used to perform basic local sequence alignment and search, molecular biology data analyses, and genome-wide association studies, among others. These workflow environments, which are both computationally intensive and logistically complex, are becoming essential to handling the large amount of data coming out of modern, highly automated laboratories.
Workflows in workflow environments typically comprise a series of connected steps each having one or more inputs, one or more outputs, a set of arguments, and an application that performs the task associated with the step. The outputs are then provided as inputs to the next step in the workflow environment. For example, with reference to FIG. 1, a common alignment and variant calling workflow called “bowtie_gatk” includes a Burrows-Wheeler Transform (BWT) based aligner (Bowtie2) for human genome resequencing and the Genome Analysis Toolkit (GATK) for variant calling. Given a set of sequence reads and a reference genome, the bowtie_gatk workflow may: (1) trim adapter sequences from the sequence reads; (2) perform a reference alignment using the Bowtie2 aligner; (3) sort the resulting alignment; and (4) call variants using GATK.
While Bowtie2 advantageously provides a fast and memory-efficient tool for aligning sequencing reads to long reference sequences and GATK offers a wide variety of tools for separating data-access patterns from analysis algorithms, several kinds of problems may affect efforts to combine these tools in large-scale workflow environments. First, failures may occur at points along the workflow. These can be caused by an incorrect configuration (e.g., an end user specifying the wrong tool or parameter), or may be attributed to the workflow environment itself. If a failure point occurs at a later stage of the workflow, it may be necessary to restart the entire workflow from the beginning, thus wasting time and computational resources. Second, workflows often have many shared steps, which can lead to processing redundancies. For example, a researcher may decide to vary the above workflow to use a different alignment tool and a different variant caller in order to compare results. Even though they use the same initial adapter trimming step, in each case the workflows are executed independently. Similarly, several users operating on the same original data set may generate workflows that are nearly identical but separately executed. In workflow environments leveraging cloud computing resources, these redundancies may lead to significant costs both monetary and in terms of resource utilization.
Users aware of these issues may redesign their workflows to improve efficiency. For example, a savvy user may decide to combine four similar workflows together into a single workflow. An initial adapter trimming step may provide an output file to separate alignment steps, which in turn provide output files to separate variant calling steps. This reduces the number of operations from 16 (4×4 workflows) to only 7. However, this requires extra planning and time on the part of the user, which is not transferred to other users. For example, the records associating files with the steps in which they originate or are used can be obfuscated, and in any case, it is challenging and cumbersome to maintain explicit records relating each of the different possible combinations of workflow steps to their associated files.
Accordingly, there is a need for workflow environments that process data in a more efficient manner without requiring significant planning from the user.