The present invention relates to a system for processing human speech. It may be used either as a research aid, for example in intelligibility testing, or it may be used as a speech processor for converting human speech to a more efficient (that is, less redundant) form prior to transmission. When human speech is processed according to the present invention, the resulting signal is a tri-level signal having the advantages of pulse-code-modulated signals for transmission. Yet, there is little loss of intelligibility. The resulting signal is quantized in amplitude and time, and all noise and squelching signal is removed between words. In this sense a "word" is a distinct utterance.
In the 1940's it was found that speech distorted in amplitude by peak clipping was still intelligible, though the quality suffered. In addition, intelligible speech could still be obtained under the worst case amplitude distortion--i.e., infinite peak clipping. Infinite peak clipping of speech is obtained by amplifying the speech signal with an infinite gain amplifier and then clipping the signal to a finite level. The speech signal is thus quantized to one of two levels depending on whether the speech signal is above or below zero. Speech processed in this manner is simplified since only "zero crossing" information is preserved.
Infinitely clipped speech can be simplified even further by quantizing it in time. In 1950 it was found that infinitely clipped speech which was allowed to change levels only at discrete time intervals was still highly intelligible provided the quantizing rate (the inverse of the time interval) was great enough.
The importance and usefulness of amplitude-dichotomized and time-quantized human speech is due both to its waveform and to the simplification it affords to speech signals. The resulting waveform is a digital signal and therefore all the benefits of modern and inexpensive digital electronics can be used for further processing or transmission. For example, correlation analysis is very easy, requiring only logical AND functions and counting. Also, the signal is exactly like binary pulse code modulation (PCM) with just two levels of quantization. Therefore it has the same benefits as binary PCM when used in a communication system.
The simplification of speech resulting from time and amplitude-quantization is due to the reduction of redundancy. This implies that the channel capacity of any system used to send speech signals can also be reduced. It also facilitates the study of those speech parameters necessary for intelligibility.
Licklider and Pollack, two well-known researchers in the field, found that unlike normal speech, infinitely clipped speech does not have intervals of quiet between the words, but exhibited a noise. The effects of this noise, though, did not impair the intelligibility of the speech and was soon masked out by the listeners. They indicated that some "squelch" system should be added in any practical application.
Licklider (1950) showed that speech which was not only amplitude quantized (i.e., infinitely peaked clipped), but also quantized in time was still highly intelligible. The time-quantization was done by allowing the infinitely clipped speech waveform to switch states at the end of the specified time interval according to one of the following rules: Rule A, the output could switch if the input clipped waveform had switched one or more times; Rule B, the output could switch only if the input waveform had switched an odd number of times in the interval. If there are one or more complete pulses (i.e., a return to the start level) in the time interval, Rule A will require the output to switch, while Rule B will require the output to remain unchanged. Rule B may be implemented by a level comparator circuit.
In this experiment, Licklider prefiltered the speech with a first-order, high-pass network before clipping to improve intelligibility. He also postfiltered the time-quantized and clipped speech with a first-order low-pass filter to improve the quality. He found that for optimal results, with the time-quantized clipped speech, both filters should have their break points at 1600 Hz.
The results of varying the quantizing rate showed that quantizing the clipped speech by Rule B produced more intelligible speech at a lower quantizing rate than by using Rule A. Using Rule B, quantization about 8000 intervals/sec. yielded articulation scores from 90% to 95%. If the speech was quantized at 4000 intervals/sec. (this would be equivalent to a sampling frequency of 4000 Hz), the intelligibility for Rule B was 50% and for Rule A 27%. At 6000 intervals/sec. Rule B yielded 75% intelligible speech and Rule A 60%. The circuit that produced Rule A quantization was inoperable about 8000 intervals/sec.
Licklider qualified these results saying that the processed speech sounded much worse than the intelligibility scores indicated, due in part to the noise between words, and that extensive training of the listeners was needed before the above results were achieved.
Ainsworth (1967) processed infinitely clipped speech such that at the zero crossings of the waveforms a 10 microsecond pulse of either positive or negative value was generated. The infinite peak clipping network consisted of "three peak clipping amplifiers (of about 20 dB amplification each) followed by a Schmitt trigger." The pulses were integrated before being heard by the listeners. He found that infinitely clipped speech and differentiated-clipped speech had 90 to 100% word intelligibility as Licklider and Pollack had found.
Thomas (1968) studied the influence of the first and second formants on the intelligibility of speech. Thomas found that all speech formants are not equally important. He band-pass filtered speech such that only the first formant frequencies, a band centered around 500 Hz, were left in the spectrograms. The word articulation score after infinite peak clipping was 7.6%. When only the second formant range of frequencies, centered around 1500 Hz, were left (along with residual third formant components) the average word intelligibility after clipping the filtered speech was 71% with a high of 92% by an experienced listener. The center frequencies of the two band-pass filters were experimentally set for the one speaker who generated the word lists.
Thomas, in his experiment, also added a squelch system which consisted of a 20,000 Hz oscillator and a summing system. The inaudible 20,000 Hz signal was added to the filtered speech signal and its level increased until there was relative quiet between the words. The amplitude of the 20,000 Hz signal was kept small enough so as not to cause appreciable center clipping distortion.
As mentioned, some type of squelching circuit is desirable to mask the noise between words. Thomas added a 20,000 Hz sine wave to the speech signal before the clipping stage. In this system, whenever the signal amplitude falls within the peak-to-peak amplitude of the 20,000 Hz squelch signal, the output of the clipper is an inaudible 20,000 Hz square wave. The amplitude of the squelch is adjusted to be equal to the amplitude of the noise between words. The squelch signal also affects the speech itself, but 0 to 2 dB of center clipping can be tolerated with no effect upon the speech intelligibility.
The squelch system for infinitely clipped speech that is also time-quantized is not as simple as the system just mentioned. For any given quantizing rate the highest frequency square wave obtainable on the output will be one-half the quantizing rate or frequency (two level changes are necessary for a complete output cycle). For example, if the quantizing rate is set at 10,000 intervals/sec. then the highest frequency square wave possible at the output of the quantizer is 5,000 Hz.
Due to the sampling of the waveform by the quantizing process, any type of frequency squelch system will yield a squelch signal on the output whose frequency will be between zero and one-half the quantizing rate. The exact frequency will depend upon the ratio of the squelch to the quantizing frequency. Thus if the quantizing rate is less than 40,000 Hz, then the squelch signal will always be between 0 Hz and 20,000 Hz no matter what the initial squelch frequency was. If it lies between the extremes of this range it will be audible.
While a constant tone masking the noise between words is probably more desirable than random noise, the present invention provides for no sound between the words. A tri-level or ternary signal is generated from the binary signal which consists of the infinitely clipped and time-quantized speech and a constant frequency square wave masking the noise between the speech words. The positive value and negative values of the tri-level signal indicate respectively that the signal level is greater than the peak positive level of the squelch signal and that it is below the peak negative level of the squelch signal. The zero level of the tri-level signal indicates that the speech signal has a value within the peak-to-peak range of the squelch signal.
It is desirable to have the squelch frequency as high as possible so as not to obscure any high frequency information in the signal. The squelch frequency is set to be exactly one-half the quantizing rate and synchronous with it.
The present invention thus provides for a speech processor which quantizes a speech signal in amplitude and in time to produce a tri-level signal having the characteristics that when the speech signal level is greater than the positive squelch level, the output will be a positive level. When the speech signal level is beneath the maximum negative level of the squelch signal, the output signal will be a negative level, and when the squelch signal exceeds the speech signal, the output signal will be at a zero level. Other features and advantages of the present invention will be apparent to persons skilled in the art from the following detailed description of a preferred embodiment accompanied by the attached drawing.