Computers are very powerful tools for processing data. A computerized data pipeline is a useful mechanism for processing large amounts of data. A typical data pipeline is an ad-hoc collection of computer software scripts and programs for processing data extracted from “data sources” and for providing the processed data to “data sinks”. As an example, a data pipeline for a large insurance company that has recently acquired a number of smaller insurance companies may extract policy and claim data from the individual database systems of the smaller insurance companies, transform and validate the insurance data in some way, and provide validated and transformed data to various analytical platforms for assessing risk management, compliance with regulations, fraud, etc.
Between the data sources and the data sinks, a data pipeline system is typically provided as a software platform to automate the movement and transformation of data from the data sources to the data sinks. In essence, the data pipeline system shields the data sinks from having to interface with the data sources or even being configured to process data in the particular formats provided by the data sources. Typically, data from the data sources received by the data sinks is processed by the data pipeline system in some way. For example, a data sink may receive data from the data pipeline system that is a combination (e.g., a join) of data of from multiple data sources, all without the data sink being configured to process the individual constituent data formats.
One purpose of a data pipeline system is to execute data transformation steps on data obtained from data sources to provide the data in format expected by the data sinks. A data transformation step may be defined as a set of computer commands or instructions which, when executed by the data pipeline system, transforms one or more input datasets to produce one or more output or “target” datasets. Data that passes through the data pipeline system may undergo multiple data transformation steps. Such a step can have dependencies on the step or steps that precede it. One example of a computer system for carrying out data transformation steps in a data pipeline is the well-known MapReduce system. See, e.g., Dean, Jeffrey, et al., “MapReduce: Simplified Data Processing on Large Clusters”, Google, Inc., 2004.
Often, data pipeline systems are maintained “by hand”. That is, a software engineer or system administrator is responsible for configuring the system so that data transformation steps are executed in the proper order and on the correct datasets. If a data transformation step needs to be added, removed, or changed, the engineer or administrator typically must reconfigure the system by manually editing control scripts or other software programs. Similar editing tasks may be needed before the pipeline can process new datasets. Overall, current approaches for maintaining existing data pipeline systems may require significant human resources.
Another problem with existing data pipeline systems is the lack of dataset versioning. In these systems, when a dataset needs to be updated with new data, the data transformation step typically overwrites the old version of the dataset with the new version. This can be problematic if it is suspected or discovered thereafter that the old version of the dataset contained incorrect data that the new version does not contain. For example, the old version of the dataset may have been imported into an analytical software program which generated anomalous results based on the incorrect data. In this case, since the old version is lost when the new version is generated, it can be difficult to track down the source of the incorrect data.
Given the increasing amount of data collected by businesses and other organizations, processing data of all sorts through data pipeline systems can only be expected to increase. This trend is coupled with a need for a more automated way to maintain such systems and for the ability to trace and track data, including old versions of the data, as it moves through the data pipeline from data sources to data sinks.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.