This invention relates to a time domain speech recognition system which utilizes the zero crossing pattern of infinitely clipped voice signals to determine and classify speech sounds.
In the past automatic recognition of speech signals has been beset by a fundamental problem, namely, the difficulty of abstracting from a complex input speech signal those parameters which are necessary for the recognition of the speech signal. The inability to overcome this difficulty has led to recognition systems which are unnecessarily complex, inefficient and error prone. Thus despite the intense interest in the automatic recognition of speech signals and the great body of literature written on this subject, none of the systems developed to date have been successful enough to be commercially practicable.
Speech recognition systems suffer from two major difficulties; the first of these is the wide difference in individual speech characteristics and the second is that an increase in system vocabulary typically requires a correspondingly substantial increase in system hardware for deciphering the speech signal. With respect to the problem of the variance of characteristics of individual speakers, a number of experimental systems have been developed which perform well when the sounds of a single individual speaker are detected. Thus in one prior art system, word recognition is based on digital autocorrelation analysis followed by computer pattern matching. The speech signal is split into two frequency bands and the signals in each band are then quantized into two amplitude levels, autocorrelated, and delivered to a computer for identification with respect to a predetermined pattern. Despite a severe vocabulary restriction, i.e., ten words, recognition accuracies for individual speakers varied from 78% to 90%. When three speakers were tested in the system, the accuracy dropped to 57%.
In a second prior art system, a low Q dispersive delay line was used as a model of the human cochlea which system produced slightly better results. In this system vowel sounds were investigated and the system's accuracy approached a reasonable 90% limit only when male speakers were tested. None of the prior art systems, however, have been capable of detecting and recognizing the speech of a wide variety of people including both men and women.
With respect to the second major difficulty with speech recognition systems, namely, that system hardware increases substantially with vocabulary, systems have been designed to recognize syllables or words rather than individual speech sounds in order to reduce system complexity. Thus, for example, in a speech recognition system which was designed to recognize digits, it was not necessary to differentiate precisely between the vowel sound o as in "oh" and the vowel sound ee as in "eeh" since "eeh" is not one of the words for which the system is designed to detect. As long as "ee" does not correlate closely with one of the vocabulary words "one" through "nine," the machine can either define it as "oh" or reject it altogether as undefinable. Whichever alternative the machine elects, the necessity for precise differentiation between o and ee is circumvented. Such a system also obviates the problem that individual speech sounds appear to have different characteristics depending upon their phonetic context. These systems are still limited to a small vocabulary because reasonable accuracy has been difficult to attain and because of the extensive hardware required for recognizing more than a limited vocabularly.
The advent of high speed digital computers has alleviated a third major problem inherent in many of the prior art speech recognition systems, namely, the problem of real time operation of the recognition system. Historically the predominate approach to speech recognition has been via the frequency domain, either by investigating the frequency spectrum of the speech signal directly or by tracking only the peaks of the spectral energy distribution of the signal with respect to time. In either case, the recognition system must usually perform either short-time Fourier transformations on the signal or perform auto and crosscorrelation calculations in the pattern comparison and matching phases. These calculations are difficult to perform in real time because high speed computers are necessary to perform the extensive calculations as rapidly as the speech sounds are generated.
Relatively few investigations have dealt with the temporal structure of the speech signal. From the earliest investigations through the development of the speech spectrograph and Vocoder to the most recent systems, the emphasis has been almost exclusively on spectral analysis of the speech signal. The research dealing with such temporal speech signal properties as the rate of zero crossings thereof often treats such properties merely as a reflection of the frequency domain properties of the signal. It has now been discovered, however, that the analysis of the distribution or pattern of zero crossings of a signal, the relationships among the adjacent intervals between zero crossings, and voice pitch together with pitch synchronous sampling of the speech signal can lead to an accurate means of identifying individual speech sounds. It has further been found that such a method is largely insensitive to individual speaker differences and phonic context.
In view of the foregoing it is an object of this invention to provide an accurate time domain speech recognition system which is capable of recognizing the speech sounds generated by individuals having a wide variety of speech characteristics.