The present disclosure relates to speech processing, and more particularly to a method for augmenting training data for speech recognition.
Data augmentation based on label-preserving transformations has been shown to be very effective at improving the robustness of deep neural networks, especially when the training data is limited. It is commonly used in image recognition where transformations such as translation, rotation, scaling and reflection have led to significant improvements in recognition accuracy.
Data augmentation in speech-related applications is not a new practice. For instance, sometimes under the name of multi-style training, artificial noisy speech data is generated by adding noise to clean speech data for training noise robust acoustic models in automatic speech recognition (ASR). Another example is IMELDA where multi-condition transforms are learned from tilted, noisy and un-degraded speech data so that the sensitivity of the transforms to those conditions is reduced.
When it comes to deep neural network (DNN) or convolutional neural network (CNN) acoustic modeling, which has achieved the state-of-the-art performance in ASR nowadays, there is less reported work on data augmentation algorithms that are specifically designed to deal with speaker variability and acoustic variability for DNN or CNN training. Most recently, vocal tract length perturbation (VTLP) was proposed for augmenting data in CNN training. Experiments on the TIMIT database have shown decent improvements in phone error rate (PER). Data augmentation using stochastic feature mapping (SFM) has been proposed for DNN acoustic modeling. SFM augments training data by mapping speech features from a source speaker to a target speaker, which is equivalent to a special type of voice conversion in some designated feature space.