In order to create speech, recognize speech, or encode and decode speech for transmission it is desirable to be able to convert the complex waveforms of human speech into digital representations thereof, and back again, as efficiently as possible so that the digital portions of the system need handle only the lowest possible data rates. Cost and technical limitations always demand that the digital data represent just the essential information. For example, data needs to be stored in memory if speech is to be generated and the size of memory should be minimized. Another example is in speech recognition systems where data needs to be analyzed, preferably in real time, as fast as the speech is produced by a human, and there is always a practical limitation on the processing power that can be devoted to this task. Digital transmission, of course, requires both digital encoding, which is a kind of recognition, and decoding at the other end, a kind of generation.
Since human speech is full of redundant information, the prior art has developed a number of ways to extract only the minimum essential content of the speech for conversion to digital form. The reader is directed to page 28 of the October 1973 issue of the IEEE Spectrum for an article summarizing a variety of these techniques. One technique for coding only the essential information involves calculating a linear predictive coefficient and encoding that coefficient (K) digitally rather than trying to encode the actual analog waveform that makes up the speech. In this way, only the perceptually significant properties of the waveform are preserved.
Human speech basically consists of voiced and unvoiced components. Voiced sounds are produced by the vibrations of the vocal chords and comprise reasonably smooth and harmonic waveforms of a variety of frequencies. Subgroups of these frequencies are emphasized by the resonant characteristics of the vocal tract which concentrate the power or energy into certain areas of the frequency spectrum. Unvoiced speech, on the contrary, is fairly noisy being produced by air turbulence through narrow constrictions of the lips or tongue. Although resonance conditions concentrate unvoiced speech in a particular frequency area, it tends, at least in that frequency area, to have its power or energy distributed evenly across the frequency area. To digitally represent speech, it is necessary only to identify if the waveform is voiced or unvoiced, that is, if the waveform is smooth and harmonic, or noisy and random, and to identify the subgroups of frequency (called formants) in the voiced portion by their power content. This can be done by converting the analog speech waveform, with a standard, commercially available, analog to digital converter, into a series of digital numbers representing the magnitude, at any given instant, of the waveform. These numbers are then continuously analyzed to see how predictably and smoothly they change, during a selected time frame, as an indication of which formants are prevalent and whether the signal is voiced or not. This procedure is called autocorrelation, and it is conducted in a time domain as contrasted with a frequency domain.
The actual analysis involves calculating certain ratios of the numbers, at great speed, as the numbers are received, for each band of frequencies of interest. As such, the calculation is extremely multiplication intensive and requires very large, very fast computers if one is to keep up with the speech in real time. With the present state of the art, it is not practical to convert from analog to digital form on a real time basis. My invention solves this computational bottleneck as described hereinafter.