Deep Belief Networks (DBNs) have become popular in the speech community over the last few years and have achieved success, showing significant gains over state-of-the-art Gaussian Mixture Model (GMM)/Hidden Markov Model (HMM) systems on a wide variety of small and large vocabulary tasks, such as, for example, large vocabulary continuous speech recognition (LVCSR) tasks. However, an issue with DBNs is that training is slow, in part because DBNs can have a much larger number of parameters (e.g., 10-50 million) compared to GMMs. Because networks are trained with a large number of output targets to achieve good performance, the majority of these parameters are in the final weight layer.
There have been some attempts in the speech recognition community to reduce the number of parameters in the DBN without significantly increasing final recognition accuracy. One common approach, known as “sparsification”, is to zero out weights which are close to zero. However, this reduces parameters after the network architecture has been defined and therefore does not have any impact on training time. Second, convolutional neural networks (CNNs) have also been explored to reduce parameters of the network, by sharing weights across both time and frequency dimensions of the speech signal. However, experiments show that in speech recognition, the best performance with CNNs can be achieved when matching the number of parameters to a DBN, and therefore parameter reduction with CNNs does not always hold in speech tasks.
Accordingly, there is a need for methods and systems for parameter reduction that can reduce training time while preserving final recognition accuracy.