Modern speech recognition systems typically include an acoustic model and a language model. The acoustic model is used to generate hypotheses regarding which words or subword units (e.g., phonemes) correspond to an utterance based on the acoustic features of the utterance. The language model is used to determine which of the hypotheses generated using the acoustic model is the most likely transcription of the utterance based on lexical features of the language in which the utterance is spoken.
Acoustic models, language models, and other models used in speech recognition (together referred to as speech recognition models), may be specialized or customized to varying degrees. For example, a speech recognition system may have a general or base model that is not customized in any particular manner, and any number of additional models for particular genders, age ranges, regional accents, or any combination thereof. Some systems may have models for specific subject matter (e.g., medical terminology) or even specific users.
Speech recognition systems may be client-based or client-server-based. For example, a computing device such as a laptop computer may include application software and data to process audio input into text output or a listing of likely transcriptions of the audio input. Some speech recognitions accept audio input via a personal or mobile computing device and transfer the audio input to a network-accessible server where the audio input is transcribed or other processing is performed.