The completion of the first human reference genome enabled the discovery of the whole catalogue of human genes, ushering in a new era of genomics research to discover the molecular basis of disease. More recently, so-called next generation sequencing (NGS) technologies can now routinely sequence entire genomes within days and for a low cost. The number of fully sequenced genomes continues to grow, and with it our understanding of human genetic variation. For example, the 1000 Genomes Project is an international collaboration that seeks to provide a comprehensive description of common human genetic variation by performing whole-genome sequencing of a diverse set of individuals from multiple populations. To that end, the 1000 Genomes Project has sequenced the genomes of over 2,500 unidentified people from about 25 populations around the world. See “A global reference for human genetic variation”, Nature 526, 68-74 (2015). This has led to new insights regarding the history and demography of ancestral populations, the sharing of genetic variants among populations, and the role of genetic variation in disease. Further, the sheer number of genomes has greatly increased the resolution of genome wide association studies, which seek to link various genetic traits and diseases with specific genetic variants.
The path from sequencer output to scientifically and clinically significant information can be difficult even for a skilled geneticist or an academic researcher. Sequencer output is typically in the form of data files for individual sequence reads. Depending on the project goals, these reads may need to be quality checked, assembled, aligned, compared to the literature or to databases, segregated from one another by allele, evaluated for non-Mendelian heterozygosity, searched for known or novel variants, or subject to any of many other analyses. Such analytical processes are often modelled as computational workflows, in which the outputs of one step (e.g., software that performs quality checking) are provided as an input to another (e.g., software that performs sequence alignment).
Today, computational workflows are commonly used in many bioinformatics and genomics analyses. Computational workflows may consist of dozens of tools with hundreds of parameters to handle a variety of use cases and data types. Various computational workflow systems exist, including Taverna and the Graphical Pipeline for Computational Genomics (GPCG). See Wolstencroft et al., “The Taverna workflow suite: designing and executing workflows of Web Services on the desktop, web or in the cloud,” Nucleic Acids Research, 41(W1): W557-W561 (2013); Torri et al., “Next generation sequence analysis and computational genomics using graphical pipeline workflows,” Genes (Basel). 2012 Aug. 30; 3(3):545-75 (2012).
As the complexity of an individual workflow increases to handle a variety of use cases or criteria, it becomes more challenging to optimally compute with it. For example, analyses may incorporate nested workflows, business logic, memoization, parallelization, the ability to restart failed workflows, or require parsing of metadata—all of which compound the challenges in optimizing workflow execution. Further, increases in complexity make it challenging to port computational workflows to different environments or systems, which can lead to a lack of reproducibility. As a result of the increasing volume of biomedical data, analytical complexity, and the scale of collaborative initiatives focused on data analysis, reliable and reproducible analysis of biomedical data has become a significant concern. Accordingly, there is a need for improvements in computational workflow execution.