Systems for automatic speech recognition (ASR) are generally challenged with the wide range of speaking, channel, and environmental conditions that humans can generally handle well. The conditions may, for example, include ambient noise, speaker variability, accents, dialects and language differences. Other variations may also be present in a particular speech pattern.
These types of acoustic variations have been found to be challenging to most ASR systems that use Hidden Markov Models (HMMs) to model the sequential structure of speech signals, where each HMM state uses a Gaussian Mixture model (GMM) to model short-time spectral representation of speech signal. Better acoustic models should be able to model a variety of acoustic variations in speech signals more effectively to achieve robustness against various speaking and environmental conditions.
More recently, deep neural networks have been proposed to replace GMM as the basic acoustic models for HMM-based speech recognition systems and it has been demonstrated that neural network (NN)-based acoustic models can achieve competitive recognition performance in some difficult large vocabulary continuous speech recognition (LVCSR) tasks. One advantage of NNs is the distributed representations of input features (i.e., many neurons are active simultaneously to represent input features) that generally makes them more efficient than GMMs. This property allows NNs to model a diversity of speaking styles and background conditions with typically much less training data because NNs can share similar portions of the input space to train some hidden units but keep other units sensitive to a subset of the input features that are significant to recognition. However, these NNs can be computationally expensive to implement.
It is an object of the following to obviate or mitigate at least one of the foregoing issues.