In communication, data processing and similar systems, it is often desirable to use audio interface arrangements. Speech input and synthesized voice output may be utilized for inquiries, commands and the exchange of data and other information. Speech type interfacing permits communication with data processor type equipment from remote locations without requiring manually operated terminals and allows concurrent performance of other functions by the user. The complexity of speech patterns and variations therein among speakers, however, makes it difficult to obtain accurate recognition. While acceptable results have been obtained in specialized applications restricted to particular individuals and constrained vocabularies, the inaccuracy of speaker-independent recognition has limited its utilization.
In general, speech recognition arrangements are adapted to transform an unknown speech pattern into a sequence of prescribed acoustic feature signals. These feature signals are then compared to previously stored sets of acoustic feature signals representative of identified reference patterns. As a result of the comparison, the unknown speech pattern is identified as the closest matching reference pattern in accordance with predetermined recognition criteria. The accuracy of such recognition systems is highly dependent on the selected features and the recognition criteria. The comparison between the input speech pattern feature sequence and a reference sequence may be direct. It is well known, however, that speech rate and articulation are highly variable.
Some prior art recognition schemes employ dynamic programming to determine an optimum alignment between patterns in the comparison process. In this way, the effects of differences in speech rate and articulation are mitigated. The signal processing arrangements for dynamic time warping and comparison are complex and time consuming since the time needed for recognition is a function of the size of the reference vocabulary and the number of reference feature templates for each vocabulary word. As a result, speaker-independent recognition for vocabularies of the order of 50 words is difficult to achieve in real time.
Another approach to speech recognition is based on probabilistic Markov models that utilize sets of states and state transitions based on statistical estimates. Speaker-dependent recognition arrangements have been devised in which spectral feature sequences are generated and evaluated in a series of hierarchical Markov models of features, words and language. The feature sequences are analyzed in Markov models of phonemic elements. The models are concatenated into larger acoustic elements, e.g., words. The results are then applied to a hierarchy of Markov models, e.g., syntactic contextual, to obtain a speech pattern identification. The use of concatenated phonemic element models and the complexity involved in unrestricted hierarchical Markov model systems, however, requires substantial training of the system by the identified speakers to obtain a sufficient number of model tokens to render the Markov models valid. It is an object of the invention to provide improved automatic speech recognition based on probabilistic modeling that is not speaker-dependent and is operable at higher speed.