A speech recognition system commonly includes acoustic model (AM) and language model (LM). Acoustic model is a model that summarizes probability distribution of acoustic feature relative to phoneme units, while language model is a model that summarizes occurrence probability of words sequences (word context), and speech recognition process is to obtain result with the highest score from weighted sum of probability scores of the two models.
In recent years, neural network acoustic model (NN AM), as a novel method, has been introduced into speech recognition systems and greatly improves the recognition performance.
In neural network acoustic model training, the traditional technology is to get each phonetic feature sample output targets by doing forced alignment and set its probability one, then train the acoustic model based on cross entropy.
Later there is also other technology which uses the probability distribution of all the output targets as the target output and train the acoustic model based on KL distance (Kullback-Leibler Divergence, also referred to as KL Divergence) which is equivalent to cross entropy.
In traditional neural network acoustic model training, both the single target training and all the output targets training do not make use of the similarity in the training targets reasonably, lack of the selection and filtering of the training targets.
For the single target training, given the training sample, the probability of output target state is one and the other states output is zero, and such training ignores the similarity between the output target state and other state and destroys the true probability distribution of target state output. For example, some other states which are very similar to the output target state should also have a reasonable probability distribution.
For all the output targets training, it also does not make use of the similarity in the training targets reasonably, and lack of the selection and filtering of the training targets.
Also in the traditional neural network acoustic model training, for the neural network acoustic model training with multiple output target states, using cross entropy as the training criteria is not flexible enough, and cannot study true probability distribution of output targets in different aspects.