"Automatic speech recognition " signifies machine conversion of sounds, created by or simulating natural human speech, into a machine-recognizable representation indicative of a word or words actually spoken. Typically, sounds are converted to a speech signal, such as an analog or digital electrical signal, which the machine then processes. Automatic speech recognition involves recognizing a spoken word as such, not determining the meaning of the word. Automatic speech recognition may be either continuous or performed on isolated words. (Determining the meaning of a spoken word is a problem of speech understanding and may require, for example, that the contextual use of the word be analyzed).
To date, numerous machines and processes for automatic speech recognition have been proposed. Most currently commercially-available automatic speech recognition systems include computer programs which are used to process intensively a speech signal using statistical models of the speech signals generated from different spoken words. This technique is generally known as the hidden Markov model (HMM) method and is generally computationally intensive. Each word which can be recognized by the machine typically must have a hidden Markov model derived for it based on the spectrum of one or more acoustic images of the word. Also, all of the HMMs for all of the words which the system is capable of recognizing typically must be derived together. Thus, adding a word to the set of words recognizable by the machine typically involves restructuring the whole lexicon. Some of these systems perform a type of segmentation of the speech signal to identify "syllables," which are then processed using HMMs.
Another kind of automatic speech recognition system uses phoneme matrices. A phoneme matrix indicates mouth position over a period of time, according to binary, or bivalent, articulatory variables representing a vocal tract configuration used to create a sound. For example, there are about twenty binary features, recognized by a theory known as generative phonology, from which a phoneme matrix is constructed. A segmented phonetic sequence is extracted from a speech signal, and converted by a set of rules to obtain a phoneme matrix. The phoneme matrix is then compared to stored sample phoneme matrices for known words using time-warping techniques. This approach has been generally discredited because the rules used to generate phoneme matrices are generally arbitrary and are not likely to be adequate models of how humans process speech.
Another kind of automatic speech recognition system is template-based. Each word has a template which represents the spectral evolution of the word over time. Such a system also uses time-warping techniques, but uses them to match the spectral change of an input speech signal with the stored templates.