Recently, the volume of information generated has been growing at an exponential rate. Since 2003, new information generated annually exceeds the amount of information created in all previous years. Digital information now makes up more than 90% of all information produced, vastly exceeding data generated on paper and film. One of the greatest scientific and engineering challenges of the 21st century is to effectively understand and leverage this growing wealth of data. Computational processes are widely-used to analyze, understand, integrate, transform, and generate data. For example, to understand trends in multi-dimensional data in a data warehouse, analysts generally go through an often time-consuming process of iteratively drilling down and rolling up through the different axes to find interesting ‘nuggets’ in the data. To mine data, various third party applications may be used to process and analyze the data and to present results using a graphical user interface. There are also applications that are used to generate data, e.g., movies, games. Due to their exploratory nature, these tasks sometimes involve large numbers of trial-and-error steps.
Ad-hoc approaches to data analysis, generation, exploration, integration, and transformation are currently used, but these approaches have serious limitations. In particular, users (e.g., scientists and engineers) need to expend substantial effort managing data and recording provenance information so that basic questions can be answered relative to who created and/or modified a data product and when, what the process used to create the data product was, and whether or not two data products are derived from the same raw data. Provenance information (also referred to as audit trail, lineage, and pedigree) captures information about the steps used to generate a given data product. As a result, provenance information provides important documentation that is key to preserving the data, to determining the data's quality and authorship, to reproducing the data, and to validating the results. The process is time-consuming and error-prone. The absence of systematic mechanisms that capture provenance information makes it difficult (and sometimes impossible) to reproduce and share results, to solve problems collaboratively, to validate results with different input data, to understand the process used to solve a particular problem, and to re-use the knowledge involved in the creating or following of a process. Additionally, the longevity of the data products may be limited without precise and adequate information related to how the data product was generated.
Although for simple exploratory tasks manual approaches to provenance management may be feasible, that is not the case for complex computational tasks that involve large volumes of data and/or involve a large number of users. The problem of managing provenance data is compounded by the fact that large-scale projects often require that groups with different expertise, and often in different geographic locations, collaborate to solve a problem. Consider, for example, exploratory computational tasks where users may need to select different algorithms and visualization techniques for processing and analyzing the data. The task specification is adjusted in an iterative process as the user generates, explores, and evaluates hypotheses associated with the information under study. To successfully analyze and validate various hypotheses, it is necessary to pose queries, correlate disparate data, and create insightful data products of both the simulated processes and observed phenomena.
Visualization is a key enabling technology in the comprehension of vast amounts of data being produced because it helps people explore and explain data. A basic premise supporting use of visualization is that visual information can be processed by a user at a much higher rate than raw numbers and text. However, data exploration through visualization requires scientists to go through several steps. To construct insightful visualizations, users generally go through an exploratory process. Before users can view and analyze results, they need to assemble and execute complex pipelines (workflows) by selecting data sets, specifying a series of operations to be performed on the data, and creating an appropriate visual representation.
Often, insight comes from comparing the results of multiple visualizations created during the exploration process. For example, by applying a given visualization process to multiple datasets generated in different simulations; by varying the values of certain visualization parameters; or by applying different variations of a given process (e.g., which use different visualization algorithms) to a dataset, insight can be gained. Unfortunately, this exploratory process contains many manual, error-prone, and time-consuming tasks. For example, in general, modifications to parameters or to the definition of a workflow are destructive which places the burden on the user to first construct the visualization and then to remember the input data sets, parameter values, and the exact workflow configuration that led to a particular image. This problem is compounded when multiple people need to collaboratively explore data.
Workflows are emerging as a paradigm for representing and managing complex computations. Workflows can capture complex analysis processes and the creation of digital objects at various levels of detail and capture the provenance information necessary for reproducibility, result publication, and result sharing among collaborators. Because of the formalism they provide and the automation they support, workflows have the potential to accelerate and to transform the information analysis process. Workflows are rapidly replacing primitive shell scripts as evidenced by the release of Automator by Apple®, Data Analysis Foundation by Microsoft®, and Scientific Data Analysis Solution by SGI®.
Scientific workflow systems have recently started to support capture of data provenance. However, different systems capture different kinds of data and use different models to represent these data, making it hard to combine the provenance they derive and to share/re-use tools for querying the stored data. Another important limitation of current scientific workflow systems is that they fail to provide the necessary provenance infrastructure for exploratory tasks. Although these systems are effective for automating repetitive tasks, they are not suitable for applications that are exploratory in nature where change is the norm. Obtaining insights involves the ability to store temporary results, to make inferences from stored knowledge, to follow chains of reasoning backward and forward, and to compare several different results. Thus, during an exploratory computational task, as hypotheses are created and tested, a large number of different, albeit related workflows are created. By focusing only on the provenance of derived data products, existing workflow systems fail to capture data about the evolution of the workflow (or workflow ensembles) created by users to solve a given problem. The evolution of workflows used in exploratory tasks, such as data analysis, contain useful knowledge that can be shared and re-used and the underlying information can be leveraged to simplify exploratory activities. There are also applications for assembling computational tasks and deriving digital object that are not represented as explicit workflows, but that share similar limitations when it comes to provenance capture.
Currently, there are no general provenance management systems or tools that can be used in conjunction with pre-existing applications including word processors, web browsers, and generally any GUI-based, event-driven application. For these applications, users that do not have the resources or expertise to build the provenance infrastructure needed resort to the manual creation and maintenance of this information, greatly hindering their ability to do large-scale and/or complex data exploration and processing. Even with the resources, application-dependent solutions are not general and can be hard to re-use in different settings and applications causing problems with interoperability due to differences in the provenance models used across systems. Thus, what is needed is a method and a system for providing provenance infrastructure and design systems that are flexible and adaptable to the wide range of requirements of various pre-existing software applications.