Deep Neural Networks (DNNs) are well-known in acoustic modeling for speech recognition, showing improvements of about 10%-30% relative to previous modeling methods across a variety of small and large vocabulary tasks. Recently, deep convolutional neural networks (CNNs) have been explored as an alternative type of neural network which can reduce translational variance in an input signal. For example, deep CNNs have been shown to offer between a 4%-12% relative improvement over DNNs across a variety of large vocabulary continuous speech recognition (LVCSR) tasks. Since CNNs model correlation in time and frequency, they require an input feature space to have this property. As a result, commonly used feature spaces, such as Linear Discriminant Analysis (LDA), cannot be used with CNNs. Common speech features which are correlated in time and frequency include Fast Fourier Transform (FFT) and Mel Filterbank (melFB) features.
Correlated features are better modeled by full-covariance Gaussians rather than diagonal Gaussians. However, full-covariance matrices dramatically increase the number of parameters per Gaussian component, often leading to parameter estimates which are not robust. Semi-tied covariance matrices (STCs) have been used to decorrelate a feature space so that it can be modeled by diagonal Gaussians. STC allows a few full covariance matrices to be shared over many distributions, while each distribution has its own diagonal covariance matrix. A covariance matrix can be full or diagonal. When the matrix is diagonal, it is the dimensions are not correlated, but when the matrix is full the dimensions are correlated.
Feature-space maximum likelihood linear regression (fMLLR) is a speaker-adaptation technique used to reduce variability of speech due to different speakers. fMLLR is a transformation that is applied to features, assuming that these features are uncorrelated and can be modeled by a diagonal covariance Gaussian, or the features are correlated and can be modeled by full covariance Gaussians.
Due to issues in parameter estimation with full covariance Gaussians, fMLLR is more commonly applied to a decorrelated space. When fMLLR is applied to a correlated feature space with a diagonal covariance assumption, little improvement in word error rate (WER) have been observed. Accordingly, there is a need for systems and methods which improve WER by applying fMLLR to correlated features using a diagonal Gaussian approximation.