Computer-implemented Automatic Speech Recognition (ASR) systems typically employ both an acoustic model and a language model of speech to convert audio representations of human speech into text. A commercial ASR system is typically initially configured with a speaker-independent acoustic model and a general language model. The ASR system may be “trained” with the speech of a particular speaker to achieve increased accuracy when processing speech from that speaker. Such training adapts the acoustic model and language model by tailoring them to the speaker's voice and lexicon respectively. Accordingly, the training process is often referred to as acoustic and language model training. Acoustic model training is typically performed using a training dataset of speech samples provided by the speaker who utters a prepared text provided by the ASR system manufacturer. Language model training process typically requires text input only. The text input preferably embodies the speaker's linguistic habits and the lexical domain of interest. Such representative text input is used to enrich the recognition vocabulary and refine word statistics stored in the general language model.
Companies that provide commercial transcription services may use ASR systems to initially process speech and produce a rough transcript together with time offsets of the transcribed words found in the speech audio. A human proofreader then typically compares the rough transcript to the audio and corrects it. The time offsets can be used for synchronization between the audio playback and the transcript display. By their nature, such services typically must rely on speaker-independent acoustic models and general language models, as the speakers are generally not “known” to the ASR systems or available to train them. As such, they are not as accurate as ASR systems that employ trained, speaker-dependent models, thus increasing the burden on the human proofreaders.