Automatic speech recognition is a widely used technology, whose applications include dictation programs, caller menu programs for telephone systems, and voice responsive ‘assistants’ on mobile telephones.
A problem with such systems is the computational load required to move from the digitally encoded speech to identifying the actual words spoken. Commercial systems rely on statistical and template matching systems, in which a particular acoustic spectrum and its changes over time are matched to a known set of spectra or spectral characteristics. In these systems, Hidden Markov Models and other general-purpose pattern-finding algorithms are used. The system is trained on exemplars of real speech, and takes its best guess at what information from any given signal is relevant to the task of recognition. The disadvantage with such systems is that they require a great deal of processing, to match extremely information-rich spectra. Accordingly, dictation programs have to be trained to work effectively with a particular user's voice. Where this is not possible, such as in caller menu systems, to provide robust operation only a relatively limited range of possible responses are identifiable. Even then, conventional speech recognition systems may fail to correctly recognise speech with a strong regional or national accent, or where the speaker has a speech difficulty.
An alternative approach has been proposed, based on linguistic theory, in which individual phonological features are identified within the acoustic signal (see for example Lahiri, Aditi & Reetz, Henning, 2002. ‘Underspecified recognition.’ In Carlos Gussenhoven & Natasha Warner (eds.), Laboratory Phonology 7, 637-676, Berlin: Mouton de Gruyter). This approach is based on the fact that specific spoken sounds appear in the acoustic spectrum in identifiable ways, so that a section of spoken speech can be used to identify a sequence of features. However, this approach has not to date been effectively implemented.