The present invention relates to distributed lockless training
A machine learning model or unit is a set of weights (or parameters), over features. Applying a new input data to the machine learning model/unit, gives a prediction output to a classification or regression problem. Distributed machine learning involves training a machine-learning unit in parallel. This consists of a cluster of machines, each one training one or more unit replicas in parallel, with data being split across the replicas. Each replica trains on a subset of data and incrementally updates the machine-learning unit every iteration. In order to ensure that each unit is created from all data, each of these replicas communicates the unit parameter values to one-another. The replicas merge the incoming unit and continue training over local data.
A number of platforms have implemented distributed machine learning. For example, Map-Reduce/Hadoop platform communicates unit updates using the file system. Hadoop uses the map-reduce paradigm. The map step consists of all replicas creating a trained unit. In the reduce step, the parallel replicas, pick up the unit from the file system and apply to their unit. Since Hadoop communicates using the file system, the training speed is limited to disk performance. Another platform is Spark which is an in-memory data processing platform that stores objects in memory as immutable objects. The data is stored as distributed objects. Each worker trains on a set of data and updates the unit. Spark and Hadoop are based on the map-reduce paradigm and perform bulk-synchronous processing of creating a machine learning unit. This is because both Hadoop and Spark are deterministic and have explicit training, update and merge steps. Synchronous unit training is slow and with a large number of workers, it can be too slow to be practical.
A third paradigm, a dedicated parameter server collects all unit updates and sends out the unit-updates to all network nodes. In these systems, all parallel units send unit updates to a single server and receive the updated unit. Hence, the parameter server receives the units, updates them to create a new unit and sends it to all replicas. While this system can train in asynchronous fashion, it is not fully asynchronous, since it requires the workers to wait for an updated unit to arrive from the parameter server.