Technical Field
The present invention relates to training Deep Neural Network (DNN), and more specifically, to an improvement of training DNN for acoustic modeling in speech recognition.
Description of the Related Art
Recently, DNN is widely used as a feature extractor for Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) systems and Acoustic Models (AMs) for DNN-HMM systems in automatic speech recognition (ASR). DNN for ASR typically comprises an input layer accepting several concatenated frames of multi-dimensional acoustic features, hidden layers, and an output layer predicting the HMM state of the center frame in the input layer. DNN for ASR automatically estimates parameters such as weights and biases between the input layer, the hidden layers and the output layer based on a certain training criterion such as cross entropy so as to predict the HMM state of the center frame in the input layer via the output layer.
In speech recognition using DNN it is common practice to concatenate some consecutive frames of the acoustic features as input of DNN. As an example of concatenating consecutive frames, Non Patent Literature (D. Povey, B. Kingsbury, L. Mangu, G. Saon, H. Soltau, and G. Zweig, “fMPE: Discriminatively trained features for speech recognition,” in Proc. ICASSP, 2005, pp. 961-964) discloses the Acoustic context expansion. In the Acoustic context expansion, a vector of posteriors is formed on each frame, and further expanded with left and right acoustic context.
The several concatenated frames accepted by the input layer include central frames consisting of the center frame and a few frames preceding or succeeding the center frame and side frames preceding or succeeding the central frames, however, in conventional DNN training, the acoustic features of side frames in the input layer are related to the HMM state of the center frame in spite of the fact that the side frames may contain irrelevant information, and therefore, there may be some risks for over-fitting to the DNN training data because of relying too much on the side frames, especially considering the frame-based processing of ASR.