Speech recognition (sometimes referred to as automatic speech recognition (ASR) or computer speech recognition) converts spoken words to text. The term “voice recognition” is sometimes used to refer to speech recognition where a recognition system is trained to a particular speaker to attempt to specifically identify a person speaking based on their unique vocal sound.
Speech recognition systems are generally based on Hidden Markov Models (HMM), which are statistical models that output a sequence of symbols or quantities. A speech signal can be viewed as a piecewise stationary signal or a short-time stationary signal, such that in a short-time, speech could be approximated as a stationary process. Speech could thus be thought of as a Markov model for many stochastic processes.
The HMMs output a sequence of n-dimensional real-valued vectors for each stationary signal. The vectors include cepstral coefficients, which are obtained by taking a Fourier transform of a short time window of speech, de-correlating the transform, and taking the first (most significant) coefficients. The HMM may have a statistical distribution that gives a likelihood for each observed vector. Each word or each phoneme may have a different output distribution. An HMM for a sequence of words or phonemes is made by concatenating individual trained HMMs for the separate words and phonemes.
Decoding of speech (e.g., when an ASR is presented with a new utterance and computes a most likely source sentence) may be performed using a Viterbi decoder that determines an optimal sequence of text given the audio signal, expected grammar, and a set of HMMs that are trained on a large set of data.