The present disclosure relates to speech recognition, and particularly to acoustic models used within an automatic speech recognition system.
Speech recognition or, more accurately, automatic speech recognition, involves a computerized process that converts spoken words to text. There are many uses for speech recognition, including speech transcription, speech translation, controlling devices and software applications by voice, call routing systems, voice search of the Internet, etc. Speech recognition systems can optionally be paired with spoken language understanding systems to extract meaning and/or commands to execute when interacting with machines.
Speech recognition systems are highly complex and operate by matching an acoustic signature of an utterance with acoustic signatures of words in a statistical language model. Thus, both acoustic modeling and language modeling are important in the speech recognition process. Acoustic models are created from audio recordings of spoken utterances as well as associated transcriptions. The acoustic model then defines statistical representations of individual sounds for corresponding words. A speech recognition system uses the acoustic model to identify a sequence of sounds, while the speech recognition system uses the statistical language model to identify possible word sequences from the identified sounds. Accuracy of acoustic models is typically better when the acoustic model is created from a relative large amount of training data. Likewise, accuracy of acoustic models is typically better when acoustic models are trained for a specific speaker, instead of being trained for the general populous of speakers.