Speech recognition is a kind of technology that transforms speech signals into text and facilitates human-machine interaction. Speech recognition is now widely used in the field of mobile Internet, etc. Speech recognition is a serial classification problem, aiming to transform a sequence of collected speech signals into a sequence of textual token outputs. Fields related to speech recognition technology include: signal processing, pattern recognition, probability theory, information theory, sound production mechanism, auditory mechanism, and artificial intelligence, etc.
A conventional speech recognition system is generally divided into three modules, namely, acoustic models, such as models described by the Hidden-Markov-Models-Gaussian-Mixture-Models (HMM-GMM) system framework; language models, such as models described by N-Grams; and a decoder, configured to transform the acoustical signals into text information by combining the resources of the acoustic models, the language models, and phoneme lexicon, etc. As the Deep Neural Networks (DNN) became more mature in recent years, it solved many multi-layer network training problems. At the same time, it can also utilize a large amount of unlabeled data. In the field of speech recognition, DNN also exhibits powerful modeling capabilities. DNN has shown great practical effect in both acoustic model training and language model training.
In conventional DNN model training, Stochastic Gradient Descent (SGD) algorithm can only estimate the model parameter serially. Due to the dependence relationship in time that exists between different speech data, it is difficult to implement multi-machine parallelization algorithms such as Map-Reduce. Therefore, it is not easy to speed up the DNN model training. To achieve a better speech recognition accuracy rate, a tremendous amount of training data is usually used to train DNN models in practical applications. However, when using the conventional SGD method, model training often takes thousands of hours, up to a couple of months. Such long training process has difficulty meeting the requirement of real-time use in applications.
In the research field, people firstly introduced the Quasi-Newton method to estimate DNN models. This is a kind of second-order optimization method. Firstly, the Quasi-Newton method approximately estimates the inverse matrix of a second-order Hessian matrix, then, uses this inverse matrix to update the model parameters. This is a batch training mode, and not an online training mode. In other words, all data only update the model once, and there is no time dependence relationship between data.
In theory, the second-order optimization is equivalent to the first-order parameter optimization and is faster than the traditional SGD method in respect to iterative convergence speed. However, in big data, the second-order parameter optimization mode usually needs much fine-tuning in details. In the absence of a priori knowledge, the second-order optimization is often not as robust as the first-order optimization. In detail, as for the DNN modeling of the speech recognition, this algorithm cannot achieve the usual performance of the SGD algorithm.