1. Technical Field
The present invention relates to data distribution, and more particularly to a compiler-guided software accelerator for iterative HADOOP® jobs.
2. Description of the Related Art
HADOOP® is the most commonly used open-source framework for MapReduce and processing large amounts of data. Data transfer and synchronization overheads of intermediate data for iterative HADOOP® applications is problematic. The problem arises because distributed file systems, such as HADOOP® Distributed File Systems, perform poorly for small short-lived data files.
HADOOP® launches a new job in every iteration which executes the same code repeatedly and reads invariant input data. Launching and scheduling new jobs is expensive.
A job is a collection of map and reduce tasks. Not all reduce tasks in a job finish at the same time. In an iterative HADOOP® workflow, the next iteration is launched upon completion of the current job. This prohibits asynchrony and parallelism across iterations.
No solutions currently exist to solve the aforementioned problems that provide a solution capable of working within the HADOOP® ecosystem without any changes to the software stack and without additional developer effort.