Numerous computer software packages and techniques exist for executing tasks on computers. Throughout modern history, a computer scientist would translate a problem into machine-readable code (e.g., a programming language or a mathematical equation) and feed it to a single computer for execution. More recently, computers have been used in a distributed fashion. For example, a program may divide up a single task into several separate tasks and execute each task on a separate computer or “node.” This can be more efficient when the individual tasks are independent of one another, and determining the result of the overall job requires only a simple combination of the results of each task.
For example, if a job is to “calculate the fastest route from zip code 07046 to zip code 22204,” and the datasets for the computation include 30 days' worth of tracked trips (500,000 trips across 300,000 separate cars) between those two zip codes, the data may be easy to divide. One approach may be to divide the datasets into five parts, have five nodes compute the fastest route, and then compare the five fastest routes to determine the true fastest route.
But when tasks are dependent upon one another, or dependent upon particular datasets, completing the job can become complicated and may lead to inefficiencies if processed in a straightforward manner. For example, problems may arise when the fastest route must be calculated on a daily basis. Imagine that a first user requests the execution of the “calculate the fastest route for the past 30 days” job on a first day, and a second user requests that the job be performed again five days later. Other than five days' worth of data, the data associated with the first run of the job will be the same as the data used in the second run of the job. In processing the second run, it would be extremely inefficient to recalculate the fastest route for each of the 25 days that are used in both runs of the job. This causes slowdowns and increased node utilization, which means that new jobs cannot be processed in a timely manner.
These problems become amplified as the amount of information being processed and stored increases. Indeed, with the rapid developments in technology, the amount of available information has expanded at an explosive pace. At the same time, however, the demand for timely information derived from this massive amount of information has increased at a similar pace. Thus, as the ability to generate, collect, and store data continues to increase, it becomes exceedingly important to improve processing efficiencies to better take advantage of the higher processing speeds brought on by the “Big Data” era.
The disclosed embodiments address these and other problems with the prior art.