Computers are very powerful tools for processing data. A computerized data pipeline is a useful mechanism for processing large amounts of data. A typical data pipeline is an ad-hoc collection of computer software scripts and programs for processing data extracted from “data sources” and for providing the processed data to “data sinks”. As an example, a data pipeline for a large insurance company that has recently acquired a number of smaller insurance companies may extract policy and claim data from the individual database systems of the smaller insurance companies, transform and validate the insurance data in some way, and provide validated and transformed data to various analytical platforms for assessing risk management, compliance with regulations, fraud, etc.
Between the data sources and the data sinks, a data pipeline system is typically provided as a software platform to automate the movement and transformation of data from the data sources to the data sinks. In essence, the data pipeline system shields the data sinks from having to interface with the data sources or even being configured to process data in the particular formats provided by the data sources. Typically, data from the data sources received by the data sinks is processed by the data pipeline system in some way. For example, a data sink may receive data from the data pipeline system that is a combination (e.g., a join) of data of from multiple data sources, all without the data sink being configured to process the individual constituent data formats.
One purpose of a data pipeline system is to execute data transformation steps on data obtained from data sources to provide the data in format expected by the data sinks. A data transformation step may be defined as a set of computer commands or instructions which, when executed by the data pipeline system, transforms one or more input datasets to produce one or more output or “target” datasets. Data that passes through the data pipeline system may undergo multiple data transformation steps. Such a step can have dependencies on the step or steps that precede it. One example of a computer system for carrying out data transformation steps in a data pipeline is the well-known MapReduce system. See, e.g., Dean, Jeffrey, et al., “MapReduce: Simplified Data Processing on Large Clusters”, Google, Inc., 2004.
Often, data pipeline systems are maintained “by hand”. That is, a software engineer or system administrator is responsible for configuring the system so that data transformation steps are executed in the proper order and on the correct datasets. If a data transformation step needs to be added, removed, or changed, the engineer or administrator typically must reconfigure the system by manually editing control scripts or other software programs. Similarly, the engineer or administrator also “hand crafts” a variety of tests to validate the transformed datasets and ensure that no fault has occurred within the data pipeline system. For example, a validation may involve determining that the transformed dataset adheres to a proper format/schema and that data has not been lost in the process. Since the validation needs for a particular data pipeline system are often unique to a particular business client and/or pipeline, it can be very difficult to reuse code implementing fault detection tests across multiple software deployments. Furthermore, in many cases, the engineer maintaining the data pipeline system is maintained by a third party business that employs many engineers who manage many different pipelines for many different clients. As a result, the lack of ability to share fault detection tests between software deployments represents a significant portion of human resource time that could be better spent optimizing the data pipeline system or working on new data pipeline systems.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.