This invention relates to the problem of time-scale modification of speech and more particularly to a speech analysis/synthesizer system based on short-time Fourier analysis. The objective is the development of a high-quality system for changing the rate of speech. The system must preserve such qualities as naturalness, intelligibility, and speaker dependent features. Furthermore, the system must not introduce objectionable artifacts often present in vocoded speech. Finally, the system should also be robust to noise, i.e., the performance of the system should not degrade severely if the source speech is corrupted by noise, as might occur in recordings of meetings or court-room proceedings.
The ability to modify the rate of speech is desirable for many reasons. The rate at which speech can be produced is constrained by physiological limitations to a maximum rate of about 110 to 180 wpm. However, the rate at which speech can be comprehended is, typically, about 2 to 3 times this rate. Furthermore, without rate change, the rate of listening is completely paced by the recording and is not controllable by the listener. Consequently, the listener cannot scan or skip sections of the recording in the same manner as scanning printed text, nor can the listener slow down difficult-to-understand portions of the recording that might arise in the context of second language learning, or in listening to degraded speech. Clearly, the ability to modify the rate of (recorded) speech, while retaining its natural quality and intelligibility, would obviate these problems and have numerous applications. The modification of a speech signal such that the resulting signal differs from the original only by its perceived rate of articulation will, henceforth, be referred to as time-scale modification or rate change of speech.
Although there are a number of presently available techniques for changing the rate of speech, they all introduce artifacts which degrade the quality of the processed speech. The most naive approach to time scaling speech is simply to play back recorded speech at a speed different from that at which it was recorded. The problem here is obvious. Even with a small change in speed, spectral distortion is perceptible. As the difference between the record and playback speed is increased, the intelligibility deteriorates rapidly. It is interesting to note that while msot people are familiar with this effect, the time-scaled speech is generally described as changed in pitch. Although the pitch is, of course, changed, so is the spectral envelope of the speech. In fact, it is probably the frequency scaling of the spectral envelope and the corresponding shift in formant frequencies that contributes most to the degradation of the speech.
For the most part, nearly all algorithms for changing the rate of speech have been based on rate changes of speech by periodically repeating or discarding sections of the speech waveform. The duration of each section is chosen to be at least as long as one pitch period, but shorter than the length of a phoneme. This technique introduces discontinuities at the section boundaries which are perceived as "burbling" distortion and overall signal degradation.
The most popular refinement of the technique of the previous paragraph is pitch-synchronous implementation. Specifically, for portions of the speech that are voiced, the sections of speech that are repeated or discarded correspond to pitch periods. Although this scheme produces more intelligible speech than the basic asynchronous pitch-independent method, errors in pitch marking and voiced-unvoiced decisions introduce objectionable artifacts. Moreover, since the ear is sensitive to these types of errors, and since pitch marking algorithms are generally sensitive to noise present in the speech, such algorithms would not be expected to be robust for noisy speech. Furthermore, even with no such detection errors, discontinuities may still be introduced.
Perhaps the most successful variant of the prior-art method is a method which uses a crude pitch detector, followed by an algorithm which repeats or discards sections of the speech equal in length to the average pitch period, then smooths together the edges of the sections that are retained. Because the method is not pitch synchronous, and therefore, does not require pitch marking, it is more robust than pitch-synchronous implementations, yet much higher quality than pitch-independent methods.
Time-scale modification of speech based on classical vocoder methods is an obvious approach. The speech would be represented by a set of time-varying parameters obtained as the output of the vocoder analyzer, the parameter tracks would be time scaled, and the rate-changed speech would then be generated by the vocoder synthesizer. However, because the fundamental consideration in the formulation of a vocoder is bandwidth reduction, the vocoders currently available in the prior art simply do not provide the high level of speech quality and naturalness we seek to attain. For example, a large class of vocoders require voiced-unvoiced decisions and pitch extraction. The resulting detection errors introduce artifacts in which the ear is particularly sensitive and which are not tolerable for our purpose.
Of the remaining classical vocoders, the only one that does not require voiced-unvoiced decisions and pitch extraction, yet is flexible enough to permit rate changes of speech is the phase vocoder. The phase vocoder is a speech analysis/synthesis system based on short-time Fourier analysis and, unlike most vocoders, can be formualted to be an identity system in the absence of parameter modification. Furthermore, there is evidence that the ear is much less sensitive to errors, in the short-time spectrum of an acoustic signal than to errors in the time-domain waveform. Unfortunately, because the theory of short-time Fourier analysis and its application to speech signals is not well understood, previous applications of the phase vocoder to changing the rate of speech generally did not achieve the quality potentially attainable from this prior art technique but which is now provided by the present invention.