Machine learning combines techniques from statistics and artificial intelligence to create algorithms that can learn from empirical data and generalize to solve problems in various domains such as natural language processing, financial fraud detection, terrorism threat level detection, human health diagnosis and the like. In recent years, more and more raw data that can potentially be utilized for machine learning models is being collected from a large variety of sources, such as sensors of various kinds, web server logs, social media services, financial transaction records, security cameras, and the like.
At least for some types of problems, the process of developing a predictive machine learning model often includes a training phase, during which a set of collected observation records called a training data set is analyzed to identify relationships between some set of input variables and one or more output variables for which predictions are to be made using the model. The training data set may comprise millions or even billions of records, and may take up terabytes or even petabytes of storage in some cases, e.g., for “deep learning” problems. In some training techniques such as those involving the use of stochastic gradient descent (SGD) or similar optimization algorithms, the training phase may often involve several passes through the training data set, e.g., until the algorithm converges on an optimization goal such as an acceptably low value of a cost function or an error function.
Analyzing extremely large training data sets on a single machine may lead to unacceptably long training phase durations. For some training techniques, it may be possible to partition the training data set among several machines. Such parallelization approaches may require model parameter updates to be synchronized among the participating machines, however. Depending on how much data has to be transferred among the set of machines, in some scenarios the benefits of analyzing the training data in parallel may be offset by the introduction of bottlenecks in the network used for the synchronization-related data transfers. Determining the optimum number of machines to use in parallel for training various types of models for various sizes of training data sets may thus present non-trivial challenges even for experienced machine learning experts.
While embodiments are described herein by way of example for several embodiments and illustrative drawings, those skilled in the art will recognize that embodiments are not limited to the embodiments or drawings described. It should be understood, that the drawings and detailed description thereto are not intended to limit embodiments to the particular form disclosed, but on the contrary, the intention is to cover all modifications, equivalents and alternatives falling within the spirit and scope as defined by the appended claims. The headings used herein are for organizational purposes only and are not meant to be used to limit the scope of the description or the claims. As used throughout this application, the word “may” is used in a permissive sense (i.e., meaning having the potential to), rather than the mandatory sense (i.e., meaning must). Similarly, the words “include,” “including,” and “includes” mean including, but not limited to.