In speech recognition model training, typically, a set of speech data files and associated transcriptions are required. Transcription is what a user voices into the recognizer. Transcription error refers to a problem in transcribing acoustic events in the utterance. Transcription errors can be generally classified as deletion error, substitution error, and insertion error. Deletion error is un-transcribed speech or non-speech events in the recorded signal (e.g., coughs, background noise, etc.). Substitution error is mistranscription or misinterpretation of transcriptions (e.g., French <<e'>> transcribed as the phrase “accent aigu”). Insertion error includes transcription that describes more than what is recorded in the audio (e.g., wave file.
Conventionally, reliance is on the quality of the corpus that can be provided by the vendor for transcription processing. Transcription errors of the training data will blur the phones of acoustic models and thus degrade recognition performance. As the accuracy of the models employed by a recognition system improves, the impact due to transcription error in training data will have a greater impact on overall operation and output quality. Conventional approaches for resolving transcription errors include manually perusing through the whole transcription to correct the errors. However, this is very expensive and time-consuming because the training data is often very large. Error processing also includes randomly sampling the data; however, this is not reliable.