The present invention relates to parallel learning.
Distributed systems aim to make many machines work harmoniously together as though they were one. With an increase in the volume of data being collected today, the need to efficiently distribute computation has greatly increased. Today, distributed computation is ubiquitous. For some problems, there are many existing implementations of distributed systems that can scale out computation efficiently, but there are many other problems where significant roadblocks prevent efficient distribution.
There are many platforms that implement distributed machine learning. One approach is shown in FIG. 1, where all models train asynchronously without any guarantees (Mini Batching). In asynchronous mini-batching, parallel models train asynchronously and exchange updates after processing a fixed number of iterations (fixed communication batch size). As different models process at different speeds, the models may diverge from one-another. Furthermore, if the developers chooses a naïve value of batch size, the computation may be imbalanced in terms of CPU and network. Another platform is called Bounded-Staleness learning shown in FIG. 2. The bounded staleness approach towards mini-batching trains asynchronously. Additionally, each node sends its iteration count apart from the updates. If a node notices it is too far ahead of another node, it slows down waiting for the slower node to catch up. In this approach, like mini-batching the communication batch size is fixed.