1. Technical Field
The present teaching relates to methods, systems, and programming for distributed computing. Particularly, the present teaching is directed to methods, systems, and programming for distributed machine learning on a cluster.
2. Discussion of Technical Background
Distributed computing is a field of computer science that studies distributed systems, which include multiple autonomous computers or parallel virtual machines that communicate through a computer network, such as a computer cluster having multiple nodes. The machines in a distributed system interact with each other in order to achieve a common goal. In distributed computing, a problem is divided into many tasks, each of which is solved by one or more computers. Distributed systems and applications may be applied as various paradigms, including grid computing, utility computing, edge computing, and cloud computing by which users may access the server resources using a computer, netbook, tablet, smart phone, game console, set-top box, or other device through the Internet. A computer program that runs in the distributed system is called a distributed application. For instance, APACHE HADOOP is a software framework that supports data-intensive distributed applications under a free license. It enables applications to work with thousands of nodes and petabytes of data. Rather than relying on hardware to deliver high-availability, HADOOP is designed to detect and handle failures at the application layer, thereby delivering a highly-available service.
Distributed machine learning is one of the distributed applications where much work focuses on the problem in the form
                                                        min                              w                ∈                                  ℝ                  d                                                      ⁢                                          ∑                                  i                  =                  1                                n                            ⁢                              l                ⁡                                  (                                                                                    w                        T                                            ⁢                                              x                        i                                                              ;                                          y                      i                                                        )                                                              +                      λ            ⁢                                                  ⁢                          R              ⁡                              (                w                )                                                    ,                            (        1        )            
where xi is the feature vector of the i-th training sample, yi is the label, w is the linear predictor (parameters), l is a loss function, and R is a regularizer. Much of this work exploits the natural decomposability over training data (xi, yi) in Equation (1), partitioning the training data over different nodes of a cluster. One of the simplest learning strategies when the number n of training samples is very large is to subsample a smaller set of examples that can be tractably learned with. However, this solution only works if the problem is simple enough or the number of parameters w is very small.
Other known solutions include, for example, online learning with averaging, gossip-style message passing algorithms, delayed version of distributed online learning, mini-batch version of online algorithms with delay-based updates, applying alternating direction method of multipliers (ADMM) for distributed learning, and applying message passing interface (MPI) to parallelize a bundle method for optimization. However, the known solutions leave something to be desired empirically when deployed on large clusters. In particular, their throughput—measured as the input size divided by the wall clock running time—is smaller than the I/O interface of a single machine for almost all parallel learning algorithms. The I/O interface is an upper bound on the speed of the fastest sequential algorithm since all sequential algorithms are limited by the network interface in acquiring data. In addition, because of their incompatibility with HADOOP clusters, those MPI-based solutions cannot take advantage of features of HADOOP clusters, such as data locality and robustness.
Therefore, there is a need to provide an improved solution for distributed machine learning on very large datasets, e.g., a terascale dataset, using a cluster to solve the above-mentioned problems.