The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.
A data pipeline system can include a plurality of datasets that are dependent on one another. Raw datasets represent data drawn from data source, such as a file system, database, or other data source. A complex dataset may be built by a data processing job on one or more input datasets upon which the complex dataset is dependent on. Thus, a complex dataset may be built based on a combination of raw datasets or other complex datasets. Thus, the overall data pipeline system may include a graph of dependencies of raw datasets and complex datasets. Traditional techniques for rebuilding complex datasets include waiting for all raw datasets to be updated before building, or setting a cutoff time for rebuilding all complex datasets. However, such techniques can be time-intensive and resource-intensive. Thus, what is needed is a technique for dynamically building complex datasets as soon as possible, to improve system resource usage and efficiency.
More specifically, distributed data processing systems are now available that implement data pipelines capable of executing serial or serial-parallel transformations on data tables. In an example pipeline, one or more raw datasets are used to build one or more derived datasets, according to one or more transformations. Source code development languages are available for expressing table schemas, transformations and other functional operations on rows or columns in the form of natural language code that can be transformed and committed in an executable form such as a SQL query.
Usually a sizable data pipeline requires rebuilding the derived datasets at least once per day, to ensure that the derived datasets accurately reflect updates to the raw datasets and any changes in the transformations. When the number and size of the datasets are large, an unreasonable amount of time may be required to complete a total build operation for all the derived datasets, using computer systems of average processing power. Moreover, updated copies of the raw datasets may arrive asynchronously, at various times during the day. Some raw datasets could arrive just before a scheduled cutoff time at which applications, client processes and the like need to access the derived datasets. The scale of a particular pipeline may not allow executing a complete build operation, to create the derived datasets, within a short time. For example, if the latest-arriving dataset is received just one hour before the cutoff time for client access to derived datasets, then there may be insufficient time to perform a full build of the raw datasets into the derived datasets.
What is needed is an improved way to build all the derived datasets, so that all build operations needed to create all derived datasets are assured to complete before the cutoff time.