The exemplary embodiment relates to optimization of integral functions and finds particular application in connection with a system and method which used an adapted weighted stochastic gradient descent approach to update a probability distribution function used for sampling data during the optimization.
Many problems involve optimizing a function (e.g., minimizing a cost function or maximizing a future reward) with respect to some parameters (such as the investment level on each different asset). The function to optimize is often an integral (for example, an expectation, i.e., the average over multiple possible outcomes). When this integral is not tractable, i.e., when it is hard to compute, the optimization problem is itself difficult. In many real-world problems, such as those involving intractable integrals, such as the averaging on combinatorial spaces, non-Gaussian integrals, etc., the solution can be expressed as an integral of a function of a sampled value and a gradient of the function. Stochastic Gradient Descent (SGD) is an optimization technique that can optimize an integral without having to compute it. However, SGD is known to be slow to converge.
Stochastic approximation is a class of methods to solve intractable equations by using a sequence of approximate (and random) evaluations. See, H. Robbins and S. Monro, “A stochastic approximation method,” Annals of Mathematical Statistics, 22(3):400 407, 1951. Stochastic Gradient Descent (SGD) is a special type of stochastic approximation method that optimizes an integral using approximate gradient steps (See, Léon Bottou, “Online algorithms and stochastic approximations,” in Online Learning and Neural Networks (David Saad, Ed., CUP 1998). It has been shown that this technique is very useful in large scale learning tasks because it can provide good generalization properties with a small number of passes through the data (See, Léon Bottou and Olivier Bousquet, “The tradeoffs of large scale learning,” in Optimization for Machine Learning, pp. 351-368 (Suvrit Sra, Sebastian Nowozin, and Stephen J. Wright, Eds, MIT Press, 2011).
The convergence properties of the SGD algorithms are directly linked to the variance of the gradient estimate. A tradeoff between the variance of the gradient and the convergence speed can be obtained using batching (see, for example, M. Friedlander and M. Schmidt, “Hybrid deterministic-stochastic methods for data fitting,” UBC-CS technical report, TR-2011-01. However, with batching, the time required for every step increases with the size of the batch.
There remains a need for an improvement to the standard SGD algorithms for solving optimization problems efficiently, without introducing errors.