1. Field
The following description relates to technology for normalizing input data of an acoustic model for gradual decoding in speech recognition.
2. Description of Related Art
In general, a speech recognition engine consists of an acoustic model, a language model, and a decoder. The acoustic model calculates pronunciation-specific probabilities for each frame of an input speech signal, and the language model provides information on how frequently a specific word or sentence is used. The decoder calculates which word or sentence is similar to an input speech based on the information provided by the acoustic model and the language model, and outputs the calculation result. A Gaussian mixture model (GMM) acoustic model has been generally used, and speech recognition performance is improving lately with the advent of a deep neural network (DNN) acoustic model. A bidirectional recurrent deep neural network (BRDNN) calculates pronunciation-specific probabilities for each frame of a speech in consideration of bidirectional information, that is, preceding and subsequent frame information, and thus receives the speech as a whole. When each frame of a speech signal input during model training is represented as an N-dimensional vector, a BRDNN acoustic model performs normalization so that each dimensional value of the vector is within a specific range. While normalization may be generally performed based on whole training data or each utterance, the BRDNN acoustic model performs normalization in units of utterances.