Conventional automatic speech recognition systems transform input signals representing speech utterances into discrete representations which are compared to stored representations and "recognized" using statistical matching techniques. In a typical system, as illustrated in FIG. 1, the input signal is filtered and digitized as a series of speech signal samples. The digitized signal samples are then converted into frames of speech data for successive short time segments, including, for example, amplitude values, fundamental and resonant frequencies, spectral energy, frequency spectrum distribution and shape, etc.
The converted frames of speech data are stored and then processed in accordance with selected methods for extracting speech features and parameters. In most presently used systems, the endpoints between words or utterance units are detected and selected speech parameters of each word unit are extracted. The extracted parameters of the word unit are then compared by statistical pattern matching to the parameters of stored templates of a reference dictionary of word units. The differences between the parameters of the input word unit and the stored templates are statistically analyzed, and an acceptably close match or a list of possible close matches is selected by decision rules. The difficult problem of producing accurate word recognition output from the results of the template matching process quickly and reliably and, further, of interpreting a correct meaning of the recognized words in order to obtain a machine response, are currently being addressed through high-level linguistic analyses of prosodics, syntax, semantics, and pragmatics of the words, phrases, and sentences of speech input.
The stored dictionary of templates in conventional systems is created and/or updated for new words using a training procedure in which a speaker pronounces each word a number of times, and a training module generates corresponding templates representing weighted averages of the relevant template parameters of the pronounced words. Systems used for different tasks may employ different vocabularies, i.e. different word sets expected to be recognized by the system. Speaker-dependent systems store different dictionaries for different speakers because of the wide variations in pronunciation and speech syntax from speaker to speaker. Recognition systems are also operated in different modes, i.e. speaker-dependent, independent, isolated word, or continuous speech recognition. As a result, a wide variety of different recognition systems have been developed and different training procedures are employed in each system depending upon the application, speaker(s), and/or operational mode.