Deep neural networks (DNNs) may be used to perform tasks such as speech recognition, image recognition, handwriting analysis, and object classification. DNNs may be trained to perform a particular task using techniques such as mini-batch based stochastic gradient descent (SGD), asynchronous SGD, model averaging, or a combination of asynchronous SGD and model averaging. However, each of these techniques has drawbacks. For example, mini-batched SGD is a sequential training procedure. Accordingly, training of DNNs using mini-batch SGD is difficult to parallelize across multiple computing devices. Further, although techniques such as asynchronous SGD or model averaging may enable the parallelization of training across multiple computing nodes, DNNs that are trained using such techniques generally produce results that are inferior to those produced using mini-batched SGD.