The present disclosure is related to systems and methods for optimization in mathematical modeling, and more specifically to systems and methods for parallelizing aspects of such problems to reduce time-to-solution and improve modeling performance.
Model fitting is a technique for developing a function (the objective function) that generalizes observed relationships between dependent and independent variables, such as between a system's input and a system's output, response of a physical process, etc. As an example, one may create a table associating the numbers of years individuals in a test group have played golf and their golf handicaps. Given that set of known years of play and corresponding handicaps, a mathematical model may be developed to estimate or predict handicaps for years of play for which there is no actual data. That is, an objective function may be developed which approximates the actual observed data, and which can be used to estimate responses in cases where actual data does not exist.
An example of such an objective function is a regression expression, such as:h(x)=θ0+θ1x+θ2x2+ . . . +θnxn where θi are parameters. It will be appreciated that other forms of expressions have similar properties but different functional expressions.
Cases in which a set of labeled (known) input and corresponding output data is provided for the purpose of developing an objective function(s) are generally referred to as supervised learning problems, and the data set is generally referred to as a labeled training set. Cases in which the data set is not labeled (e.g., there is no indication of the nature of the training data) are generally referred to as unsupervised learning, and the data set is generally referred to as an unlabeled training set. The present disclosure applies to both supervised and unsupervised (as well as hybrid) learning techniques. However, specific techniques for supervised and unsupervised learning are beyond the scope of this disclosure.
An objective function may be a classifier or a regression function. If outputs of the function are discrete values, the function is often referred to as a classifier. If outputs of the function are continuous values, the function is often referred to as a regression function.
In the process of determining appropriate parameters for an objective function, a starting set of parameters are often provided, and the parameters are refined to fit labeled or unlabeled training data. Once an acceptable set of parameters are determined, the objective function may be evaluated for input values not present in the training set (i.e., the objective function may be used to make predictions). Model fitting is a crucial and often very time consuming component of machine-learning and forecasting algorithms.
Many examples of applications of model fitting exist today. Among the many example applications are image classification such as where the model is fitted to label a set of pictures based on an already labeled subset of the images. In this case, the application may learn to detect features and use the detected features to identify whether a picture belongs to a class. In general this has several practical applications, such as handwriting recognition, automatic labeling for search, filtering unwanted results, etc.
Another example application of model fitting may include natural language processing. In this example, classifying sound samples may be used to recognize words or phrases, determine speaker language, translate spoken words, and transcribe spoken words. Sound classification may also be used to control hardware and/or software, and serve as a form of human-computer interface.
A further example application of model fitting may include text analysis and recognition. In this example, handwriting or typography may be recognized and converted to a digital format, evaluated for content, authenticity, and so on. Applications include optical character recognition, text filtering (e.g., spam filtering in email), and hardware and/or software control, such as serving as a form of human-computer interface.
Other example applications of model fitting may include forecasting and predicting, such as for traffic patterns (e.g., physical or data traffic), human behavior (e.g., consumer decisions), financial patterns (e.g., housing prices), propagation (e.g., disease spreading), diagnoses (e.g., likelihood of malignancy), and so on. This can be used for informed decision making, better resource allocation and so on.
When developing the objective function h(x), referred to as an hypothesis, the “closeness” of the hypothesis (and hence the accuracy of the parameters) to the actual input/output relationship is examined. One example of a measure of this closeness is referred to as a “cost function”, such as given by the relationship:
      J    ⁡          (              θ        →            )        =            1              2        ⁢                                  ⁢        m              ⁢                  ∑                  i          =          1                m            ⁢                        [                                    (                                                h                                      θ                    →                                                  ⁢                                  x                                      (                    i                    )                                                              )                        -                          y                              (                i                )                                              ]                2            where {right arrow over (θ)} is a vector of parameters [θ1, θ2, . . . θn], X(i) is the ith input variable, y(i) is the ith output variable, and m is the number of training examples. The values of {right arrow over (θ)} are determined such that J({right arrow over (θ)}) is minimized and the hypothesis, h(x), most closely models the actual relationship represented in the training set (and hence the system from which the training set is obtained).
A wide variety of methods are in use today for determining the vector {right arrow over (θ)} of parameters that most closely represents the observed data. Certain of these methods rely on attributes of the objective function represented by a first derivative or first partial derivative (such as the gradient of the objective function), and accordingly are referred to as “first-order” methods. Other methods rely on attributes of the objective function represented by higher order derivatives (such as second partial derivatives), and are accordingly referred to as methods of “order greater than one,” or equivalently “higher-order” methods. Higher-order methods present a number of advantages over first-order methods. One advantage is that higher-order methods are generally more autonomous, and converge more reliably without significant user intervention. First-order methods, however, converge with fewer associated computations, meaning they provide lower computational cost when compared to higher-order methods.
A number of higher-order methods are known. One example is the Broyden-Fletcher-Goldfarb-Shanno (“BFGS”) method. BFGS is a member of the family of secant methods (quasi-Newtonian methods) used for finding a root of the first derivative of a target function. In general these methods examine the curvature of the target function in order to provide convergence on a minima of the function J({right arrow over (θ)}). Therefore, BFGS relies on evaluation of the second order partial derivatives, in square matrix form referred to as the Hessian (∇2J(θ)) matrix, that describes local curvature.
In many modern problems, such as image or speech recognition, there may be very large training sets, for example millions or more data pairs. However, evaluating the Hessian matrix for such relatively large data sets is computationally quite expensive and slow; in some cases the data set is sufficiently large that a problem cannot be reasonably computed on a single computer. Therefore, limited-memory methods, such as the limited-memory BFGS (L-BFGS) method have been developed to reduce computing cost and improve scalability for large data sets. In L-BFGS the Hessian matrix is only approximated, and a relatively small history of prior estimates is sufficient for the algorithm to converge on a minima. (See, e.g., Nocedal, Numerical Optimization (Springer, 2006), e.g., pp. 164-189, the entirety of which is incorporated herein by reference.) Nonetheless, it is generally accepted that even limited-memory methods such as L-BFGS do not scale well to very large data sets from storage and computation cost perspectives. While BFGS and L-BFGS are referred to above, similar reasoning applies to other known higher-order methods. Therefore, for reasons of storage and computation cost, even though less autonomous than higher order methods, there is a preference for first-order methods such as gradient descent when the training set is very large.
In order to provide the benefits of higher order methods while reducing associated compute costs, distributed, parallel operation of methods such as L-BFGS has been explored. In one such distributed operation the data set is broken up into groups referred to as shards. In one example, each shard is operated on by an independent (“worker”) processor, which calculates certain values, such as estimating appropriate derivatives for that shard of data. The results from each independent processor are provided to a “master” processor, which ultimately forms evaluates the overall cost function, and updates the parameter vector.
According to such known methods for independent processing of data shards, a complete replica of the parameter vector is provided to each worker processor. The worker processor calculates derivatives or gradients for its particular shard, and may evaluate the cost function for that data as well. The worker processor provides the gradient calculations (and/or the cost function evaluation) to the master processor. The master processor then modifies a master set of parameters in an effort to minimize the cost function for all shards.
While this distribution of the data and processing parallelizes the processing of the large training set, and thereby reduces compute cost, there is a desire to further reduce cost to provide effective parallel processing in higher order methods such as L-BFGS and the like. In particular, there is a high computational cost for distribution of the entire parameter vector to each worker processor, and further cost for processing of the entire parameter vector by each worker processor.