When training deep learning models, stochastic gradient descent (SGD) is the most widely used optimization method though it can be very sensitive to hyperparameter values and is not straightforward to parallelize. SGD variants, such as ADAM-SGD and Momentum-SGD, have been proposed to improve SGD performance. Though these variants can be more efficient and more robust, tuning their hyperparameters remains a daunting task. As a type of quasi-Newton method, the limited memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS) generally requires fewer iterations to converge, requires much less hyperparameter tuning, and is naturally parallelizable. Though there has been some progress recently in using stochastic L-BFGS for machine learning, stochastic L-BFGS overall is not as efficient as SGD for deep learning because it may become trapped in local minima, may require a long training time, and may produce large errors.