In various types of communication and data processing systems, it is advantageous to use speech interface arrangements for inquiries, commands, and exchange of data and other information. The complexity of speech patterns and variations therein among speakers, however, makes it difficult to construct satisfactory automatic speech recognition equipment. While acceptable results have been obtained in special applications restricted to particular individuals and constrained vocabularies, the limited accuracy of automatic speech recognizers has so far precluded wider utilization.
In general, automatic speech recognition arrangements are adapted to transform an unknown speech pattern into a frame sequence of prescribed acoustic features. These acoustic features are then compared to previously stored sets of acoustic features representative of identified reference patterns. The unknown speech pattern is identified as the closest matching reference pattern. The accuracy of identification is highly dependent on the features that are selected and the recognition criteria used in the comparisons. Where a large vocabulary of reference patterns is used, the storage requirements for the reference pattern features and the signal processing needed for comparison result in expensive data processing equipment and long delays in pattern recognition. It is well recognized that a speech pattern is a concatenation of a definite number of subunits and that a large vocabulary of reference patterns such as words or phrases may be replaced by a much smaller number of speech subunits such as syllables or demisyllables withut affecting the speech recognition process. As is well known in the art, segmentation of a speech pattern into syllabic units permits the use of a very small vocabulary of stored patterns to recognize an unlimited variety of speech patterns.
A syllable may be defined linguistically as a sequence of speech sounds having a maximum or peak of inherent sonority between the two minima of sonority. Priorly known arrangements for detecting syllabic segments are relatively complex, require high quality speech signals for segmentation, and have not been completely successful. The article "Automatic Segmentation of Speech into Syllabic Units" by Paul Mermelstein, Journal of the Acousticsal Society of America, Vol. 58, No. 4, October, 1975, pp. 880-883, for example, discloses an arrangement in which a modified energy function obtained from a high quality speech signalis transformed into a signal corresponding to human perception of "loudness". A search is made for minima in the "loudness" signal using an artificially generated convex hull function to evaluate energy peaks, depth of minima and time between peaks. Deparatures from true syllabification are accepted if they are consistent. For example, the "ty" porton of the word "twenty" maynot be detected as a single syllable and the fricative "sh" of the word "she" might be segmented as an independent syllable.
In the system disclosed in "An Approach to Speech Recognition Using Syllabic Decision Units" by G. Ruske and T. Schotola appearing in the Processing of the Conference on Acoustics, Speech and Signal Processing, Tuls, Okla., 1878, pp. 722-725, a speech signal is preprocessed to develop 22 specific loudness functions covering the frequency range of 70 Hz to 10 Hz arranged on a critical band rate scale. A modified and smoothed function is formed corresponding to the weighted sum of all 22 loudness functions to emphasize the middle and reduce the high portions of the frequency range. The modified function suppresses the influence of high energy fricatives and increases the loudness gain of vowels. The aforementioned techniques are adapted to provide syllabic segmentation with high quality speech and require extesive signal processing to deal with the effects of fricatives and other characteristics of speech patterns. There are, however, many applications for speech recognition where only limited quality speech signals are available, e.g., telephone connections, and the real time response requirement precludes prolonged segmentation processing. It is an object of my invention to provide improved syllabic segmentation in automatic speech analysis systems with limited quality speech patterns.