The subject matter discussed in this section should not be assumed to be prior art merely as a result of its mention in this section. Similarly, a problem mentioned in this section or associated with the subject matter provided as background should not be assumed to have been previously recognized in the prior art. The subject matter in this section merely represents different approaches, which in and of themselves may also correspond to implementations of the claimed technology.
One limitation of deep end-to-end speech recognition models is that there exists a disparity between the objective function used during training and the evaluation criteria used during inference. In the training stage, a deep end-to-end speech recognition model optimizes a differentiable maximum likelihood objective function (MLOF) such as connectionist temporal classification (CTC). However, the recognition accuracy during inference is evaluated based on discrete and non-differentiable performance metrics such as word error rate (WER) and character error rate (CER), which calculate the minimum string edit distance between the ground truth transcription and the output transcription. Thus, due to this disparity, it remains unclear how well the model approximates real-world speech during inference.
As discussed above, the model uses maximum likelihood objective function (MLOF) to maximize the likelihood of training data, as opposed to optimizing error rate evaluation metrics which actually quantify recognition quality. MLOF maximizes the log probability of getting the whole transcription completely correct. The relative probabilities of incorrect transcriptions are therefore ignored, which implies that they are all equally bad. In most cases however, transcription performance is assessed in a more nuanced way. MLOF makes no distinction between incorrect transcriptions and equally penalizes them through normalization regardless of how near or far they are from the ground truth transcriptions.
In contrast, performance metrics such as WER and CER typically aim to reflect the plausibility of incorrect transcriptions. For example, WER penalizes less for an output transcription that has less edit distance to the ground truth transcription. This makes it possible for incorrect transcriptions with low WER to be preferred over those with high WER.
Optimizing model parameters with the appropriate training function is crucial to achieving good model performance. An opportunity arises to directly improve a deep end-to-end speech recognition model with respect to the evaluation metrics such as WER and CER, thereby improving relative performance for an end-to-end speech recognition model as compared to the same model learned through maximum likelihood. The disclosed systems and methods make it possible to achieve a new state-of-the art WER for the deep end-to-end speech recognition model.