Large-scale data processing may include parallel processing, which generally involves performing some operation over each element of a large data set simultaneously. The various operations may be chained together in a data-parallel pipeline to create an efficient mechanism for processing a data set. Production of the data set may involve “batch jobs” that are run periodically over a set of large, evolving inputs. As the inputs are updated, the previous output becomes more and more stale, so the pipeline is re-run on a regular basis.