Technical Field
The present invention, generally, relates to machine learning, more particularly, to learning a model for recognition processing.
Related Art
Deep Neural Network (DNN) has been widely used in various recognition processing systems such as automatic speech recognition (ASR) systems, optical character recognition (OCR) systems, motion recognition systems, etc.
In ASR, it is known that DNNs with many hidden layers can outperform Gaussian mixture models (GMM), which is also used together with hidden Markov models (HMM) in acoustic models to determine how well each state of each HMM fits a frame or a short window of frames of coefficients that represents the acoustic input, on a variety of speech recognition benchmarks (G. Hinton, et al. “Deep Neural Networks for Acoustic Modeling in Speech Recognition.” IEEE Signal Processing magazine 29(6):82-97. 2012.).
It is also known that better phone recognition can be achieved by replacing GMM by DNNs (A. Mohamed, et al. “Acoustic Modeling using Deep Belief Networks.” IEEE Transactions on Audio, Speech, and Language Processing 20(1): 14-22. 2012). The networks are first pre-trained as a multilayer generative model of a window of spectral feature vector without making use of any discriminative information. Once the generative pre-training has designed the features, they perform discriminative fine-tuning using a back propagation to adjust the features slightly to make them better at predicting a probability distribution over the states of mono-phone hidden Markov model.
The DNN for the acoustic model has one or more layers of hidden units between input and output layers, and takes acoustic features as input and produces posterior probabilities over HMM states as output. For the input of DNN, a plurality of frames of acoustic features is typically used. Generally, wider input frames may retain richer information, thus resulting in better accuracy. However, using wider input frames increases latency and computation cost during the recognition process, thereby negatively impacting user experience, especially for a real-time recognition task. Hence, there is a tradeoff between accuracy and latency in a conventional DNN based acoustic model. Such a tradeoff may also arise in other recognition models, such as image recognition model, motion recognition model, etc.
What is needed is a method, associated computer system and computer program product capable of improving recognition accuracy without increasing latency and computation cost during recognition processing.