Machine learning and data mining are both a method of finding a rule or a pattern from data and used in various scenes such as information recommendation, face authentication, voice recognition, and document classification. Various methods have been proposed for such machine learning and data mining. Many of the proposed methods design a model that describes data, generate a function (e.g., log-likelihood) representing a degree of description with regard to the model, and optimize (maximization, in a case where the function to be used is the log-likelihood) a model parameter of the function, whereby learning is performed.
For example, a steepest descent method, a probabilistic gradient-descent method, an EM (Expectation Maximization) algorithm, and the like are used for the above maximization. The greater the number of pieces of data to be learned, the longer a time required for the optimization, so that parallel and distributed processing are desirably applied to large-scale data learning.
As a method that performs the optimization in a parallel and distributed fashion, a DGD (Distributed Gradient Descent) method and an IPM (Iterative Parameter Mixtures) method are proposed (see Non Patent Literatures 1 and 2 listed below). The DGD method is a method that performs the steepest descent method by parallel distributed method. The DGD method partitions data into N pieces, calculates in parallel a gradient of the sub data set in each of the partitions, adds the calculated gradients to obtain a gradient, and updates a model based on the gradient. The DGD method is an iterative algorithm and, therefore, the above processing is repeated until convergence is reached. The IPM method partitions data into N pieces and applies in parallel the probabilistic gradient-descent method to each partitioned data. As a result, N different models are calculated and then averaged. The IPM method is also an iterative algorithm and, therefore, the above processing is repeated until convergence is reached.
Non Patent Literature 1 listed below shows an experimental result that the IPM method provides high-speed processing for the optimization in structured perceptron or a maximum entropy method. Further, when the DGD method or IPM method is implemented, MapReduce (see Non Patent Literature 3), which is a distributed processing framework, can be used. Thus, the DGD method and IPM method are advantageous in that they can be easily implemented even by users not familiar with distributed programming.