Technical Field
The present invention relates to information processing, and more particularly to fine-grain synchronization in data-parallel jobs.
Description of the Related Art
Big-data processing involves multiple parallel Workers, or multiple workers and a single master. Existing worker synchronization is typically performed using a barrier primitive or via a file-system (using stage-wise synchronization).
Barrier synchronization is an important and widely used operation for synchronizing parallel systems. Upon encountering a barrier operation, a process waits until all processes in the system have reached a barrier. The barrier operation is the most commonly cased synchronization primitive in data-parallel primitive.
However, this style of synchronization suffers from several problems. First, barrier primitives are slow and removing such a primitive (asynchronous) breaks down correctness semantics. Second, most barrier implementations synchronize with all processes and may be slow to synchronize a subset of workers. Third, using a barrier with a bulk-synchronous processing paradigm suffers from mixed-version issues. For example, in the absence of receiver side synchronization, there may be torn-reads and over-writes. This is because a barrier gives no information if the recipient has seen or processed the gradient and additional expensive synchronization may be required. Finally, using a barrier also causes network resource spikes since all workers will send intermediate values at the same time. Thus, there is a need for improved synchronization in data parallel jobs.