Many scientific and engineering experiments explore different computer applications and settings. These different applications are chained as the data produced by one application is consumed by subsequent ones. Large volumes of data of different types can be explored throughout this chain of applications. This process demands certain levels of control in order to guarantee the reliability and reproducibility of the experiment. This experimental process is often supported by scientific workflows on High Performance Computing (HPC) environments. Due to this fact, these workflows are typically referred to as HPC workflows. Scientists and engineers can manage the execution of their applications, their data-flow and provenance data by using HPC workflow management systems.
Within these workflows, various tasks are executed and combined. As there are usually various alternative applications for each task, a single workflow can have many different execution plans. There are many variables involved in the design and execution of workflows. Thus, there are also many opportunities for optimization, targeting different goals, such as execution time, resource utilization and accuracy, among others.
In order to provide reliability and reproducibility, it is necessary to save workflow provenance data. Provenance data can provide a rich source of information about the behavior of the workflow under different circumstances. In addition, provenance data can also be instrumental to optimize the workflow for different scenarios and goals.
Over time, HPC workflows have many executions and their provenance database grows very quickly in size. The velocity of data ingestion is very high because many workflows can be executed simultaneously and every task of the workflow typically stores provenance data continuously. The provenance data must be analyzed efficiently, in particular when there is a need to optimize the workflow at run-time. The variety of data is also an issue because there are usually several types of data that are accessed within a scientific domain.
A need exists for improved techniques for analytical processing of provenance data for optimization of HPC workflow execution. A further need exists for improved techniques for capturing large amounts of provenance data efficiently and quickly in a distributed environment, without compromising the overall performance of the workflow execution.