Our invention relates to pattern recognition and more particularly to arrangements for automatically recognizing a continuous speech pattern as a series of words.
In communication, data processing, and control systems, it is often desirable to use speech as a direct input for inquiries, commands, data or other information. Speech recognition devices obviate the need for manually operated terminal equipment and permit individuals to interact with automated equipment while simultaneously engaging in other activities. The variability of speech patterns from speaker to speaker and even for a particular speaker, however, has limited the accuracy of speech recognition. As a result, speech recognition arrangements have been most successful in specially designed environments.
Speech recognition systems are generally adapted to transform input speech signals into sets of prescribed acoustic features. The acoustic features of the input speech signals are compared to stored sets of previously obtained acoustic features of identified reference words. The speech signal is identified when the input speech features match the stored features of a particular reference word sequence in accordance with predetermined recognition criteria. The accuracy of such recognition systems is highly dependent on the selected features and on the prescribed recognition critieria. Best results are obtained when the reference features and the input speech features are derived from the same individual and the input speech pattern to be recognized is spoken with distinct pauses between individual words.
Recognition of continuous speech patterns may be accomplished by comparing the sequence of input speech features with every possible combination of reference word feature signal patterns derived from continuous speech. Such arrangements, however, require time consuming testing on all possible reference word pattern combinations and an exhaustive search through the large number of reference word combinations. As is well known, the number of possible sequences increases exponentially with the number of words in the series. Consequently, it is generally impractical to perform the exhaustive search even for a limited number of words in a speech pattern.
Semantic and syntactic rules may be devised to limit the number of possible sequences in a search so that certain classes of information can be readily analyzed. U.S. Pat. No. 4,156,868, issued to S. E. Levinson, May 29, 1979, and assigned to the same assignee, for example, discloses a recognition arrangement based on syntactic analysis in which an input speech pattern is compared to only syntactically possible reference patterns. But recognition of sequences of unrelated spoken words such as a series of spoken numbers is not improved by resorting to such contextual constraints.
U.S. Pat. Nos. 4,049,913 and 4,059,725 disclose continuous speech recognition systems in which the similarity between individual reference word feature patterns and the features of all possible intervals of the input speech pattern are calculated. Partial recognition results are derived for each reference word feature pattern from the similarity measures. Both the partial similarity measures and the partial recognition results are stored in a table. The recognized results from the table are extracted to provide the reference word series corresponding to the input speech pattern. All possible partial pattern combinations from the table which form continuous patterns are selected. The selected pattern for which the similarity is maximum is then chosen. While these systems have been effective in continuous speech recognition, the signal processing to obtain reference patterns and partial pattern similarity measures is exceedingly complex and uneconomical for many applications.
U.S. patent application Ser. No. 138,647 of F. C. Pirz and L. R. Rabiner filed Apr. 8, 1980 assigned to the same assignee discloses a continuous speech analyzer adapted to recognize an utterance as a series of reference words for which acoustic feature signals are stored. Responsive to the utterance and reference word acoustic features, at least one reference word series is generated as a candidate for the utterance. Successive word positions for the utterance are identified. In each word position, partial candidate series are generated by determining reference word corresponding utterance segments and combining reference words having a prescribed similarity to the utterance segments with selected partial candidate series of the preceding word position. The determined utterance segments are permitted to overlap a predetermined range of the utterance segment for the preceding word position candidate series to account for coarticulation and differences between acoustic features of the utterance and those for reference words spoken in isolation.
The last mentioned arrangement significantly reduces the signal processing complexity by selecting particular candidate partial word series for each successive interval of the unknown utterance and also improves recognition in the presence of coarticulation. The selection of certain candidates at each word position, however, precludes other possible reference word series candidates from consideration as the recognition progresses through each word position. Consequently, the accuracy of utterance recognition is limited for longer utterances. It is an object of the invention to provide improved recognition of continuous speech pattern with limited signal processing requirements.