Although distributed machine learning (ML) algorithms have been extensively studied, scaling to a large number of machines can still be challenging. Most fast converging single machine algorithms update model parameters at a very high rate which makes them hard to distribute without compromises. As one example, the single-machine stochastic gradient descent (SGD) technique updates model parameters after processing each training example. As another example, the coordinate descent (CD) technique updates the model parameters after processing a single feature.
Common approaches to distribute SGD or CD break the basic flow of the single-machine algorithm by letting updates occur with some delay or by batching. However, this changes the convergence behavior of the algorithm, making it sensitive to the number of machines as well to the computing environment. As a result, scaling can become non-linear and the benefit from adding more machines can tail off early.
Because of these scaling problems, some authors have argued that it is better to scale out ML algorithms using just a few ‘fat’ servers with lots of memory, networking cards, and GPUs. While this may be an appealing approach for some problems, it has obvious scaling limitations in terms of I/O bandwidth. Generally speaking, it is also more expensive than scaling out using low-cost commodity servers. GPUs in particular are not always a cost effective solution for sparse datasets.