In communication, data processing and other information systems, it is desirable to provide speech recognition input and synthesized voice output for inquiries, commands, and exchange of information. Such speech interface facilities permit interaction with data processing equipment from remote locations without expensive terminal equipment and allows a user to communicate with computer type devices in a natural manner without manually operated keyboards. While the advantages of speech interface facilities are well known, providing the speech recognition accuracy required for commercial use has presented formidable technical problems. Accurate speech recognition is relatively difficult to achieve because of the complexity of speech patterns and variations thereof among speakers. Acceptable results have been obtained in specialized applications where the recognition is restricted to particular individuals using constrained vocabularies. The success of automatic speech recognition equipment, however, is very limited where there is no restriction on the number of speakers or where the vocabulary of speech patterns to be identified is large.
Speech recognition arrangements generally are adapted to convert an unknown speech pattern to a sequence of prescribed acoustic features which is then compared to stored sets of acoustic feature sequences representative of previously identified speech patterns. As a result of the comparison, the unknown speech pattern may be identified as the stored set having the most similar acoustic feature sequence on the basis of predetermined recognition criteria. Recognition accuracy of such systems is highly dependent on the acoustic features that are prescribed and the recognition criteria used.
The comparison between an unknown speech pattern and the stored reference sets may be direct or may be adjusted to take into account differences in speaking rate and differences in articulation. Some speech recognition systems employ dynamic programming to determine the optimum alignment between patterns. Such dynamic time warping mitigates the effects of variations in speech rate and articulation on recognition accuracy. The signal processing arrangements for dynamic time warp comparisons, however, are complex and the time needed for recognition of a speech pattern is a function of the size of the reference pattern vocabulary as well as the speed of operation of the recognition equipment. Where the recognition is speaker independent, the number of reference patterns is very large so that real time recognition of a pattern for vocabularies of the order of 50 words is difficult to achieve with acceptable accuracy.
Another approach to automatic speech recognition uses probabilistic modeling, e.g., Markov models, in which the sequence of acoustic features of a speech pattern is patterned into a series of transitions through a set of states based on statistical estimates. Speaker dependent recognition arrangements such as described in the article, "The DRAGON System-An Overview", by James K. Baker, appearing in the IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-23, No. 1, February 1975, pp. 24-29, have been devised in which acoustic spectral feature sequences corresponding to speech patterns are generated and evaluated in a series f hierarchical Markov models of acoustic features, words and language. The acoustic feature sequences are analyzed in Markov models of phonemic elements. The models are concatenated into larger acoustic elements such as words and the results are then processed in a hierarchy of Markov models, e.g., syntactic, contextual, to obtain a speech pattern identification. The use of concatenated phonemic element models and the complexity involved in unrestricted hierarchical Markov modeling, however, requires many hours of system training by each identified speaker to obtain a sufficient number of model tokens to render the Markov models valid.
A speaker indeendent recognition system described in the article, "On the Application of Vector Quantization and Hidden Markov Models to Speaker-Independent, Isolated Word Recognition", by L. R. Rabiner, S. E. Levinson, and M. M. Sondhi, appearing in The Bell System Technical Journal, Vol. 62, No. 4, April 1983, pp. 1075-1105, employs a relatively simple Markov model having a restricted number of states and state transitions. Advantageously, this speaker independent arrangement reduces the complexity of recognition processing so that the speed of identification of a speech pattern is less dependent on vocabulary size and the capabilities of the processing devices. As a result, real time recognition is obtained.
While speech recognition processing may be simplified using Markov modeling, the generation of the signals that form the models of reference patterns to which an unknown pattern is compared is complex and time consuming and subject to inaccuracies. These factors have inhibited the practical application of Markov model speech recognition. It is an object of the invention to provide improved automatic speech recognition based on Markov modeling that includes faster and more accurate model formation.