Processing large amounts of data is a common practice for small and large enterprises alike. There are multiple ways to process “big data,” including writing a single monolithic program that takes input from one or more sources, performs a series of operations on the input and on intermediate results, and generates a final result. However, such an approach may take a long time to implement, especially if the operations are many and complex.
Another approach involves implementing a workflow that comprises multiple computer jobs. A computer job (or simply “job”) is a unit of work. A component of a job is called a task or step. Each job corresponds to a different set of one or more operations that is required to finally generate a valid output at the end of the workflow.
An advantage of this approach is that each job may be implemented by a different person or different group of people. Each job needs to be able to read input data and write output data in a format that a consumer of the output data (e.g., another job in the workflow) is expecting.
One approach for implementing a workflow is for a job scheduler to start each job in the workflow when the input for that job is assured of being ready to read and process. The job scheduler detects when the job completes and, in response, initiates the next job in the workflow. For example, the job notifies the job scheduler that the job is finished. This ensures that a job reads complete and consistent data. Otherwise, each subsequent job in the workflow is likely to crash/fail altogether or, worse, produce incomplete and inconsistent output that is difficult to locate during a debug of the workflow, presuming a problem in the data is eventually identified, in which case much damage may have already been done.
Thus, current approaches to implementing workflows rely on a sequential paradigm where each iteration of a workflow requires the last job in a prior execution to complete before the first job in a current execution begins. Another approach is to run multiple executions of a workflow in parallel. However, this has disadvantages that are not easily handled. For example, job B in a first execution of a workflow is taking a relatively long time to execute, then job A (which precedes job B in the workflow) in a second execution of the workflow might write partial data that job B is not expecting or that job B ends up reading, which may cause job B to crash or (in potentially worse scenarios) produce invalid output for job C.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.