The usefulness of an economical system for real time pitch changing of an audio signal or for speech compression and/or expansion (that is, pitch restoration of the audio signal generated by speeded or slowed playback of a recording) is well recognized today. The early forms of such systems were electromechanical tape players with moving magnetic read heads. These systems produced the equivalent of cutting the record tape into short segments and splicing alternate segments together. These early schemes have been replaced by all-electronic systems such as those described in Schiffman patents U.S. Pat. No. 3,786,195 and U.S. Pat. No. 3,936,610 which have been widely used commercially.
The Schiffman approach and most other practical systems rely on a pitch change-splice approach. That is, in the case of audio pitch lowering, regular segments of the signal are stretched to achieve pitch change and the intervening remainders are deleted resulting in discontinuities created by the deletion. In the case of audio pitch raising, the repetitive pitch change is accomplished by compressing the time interval occupied by the signal segments thus creating gaps; the compressed segments are then repeated as necessary to fill the gaps created by the compressing of the signal.
Continual work has been done on improving the sound quality of the "pitch change-splice" methods, mostly centered on improving the splicing scheme. The suggested approaches usually involved a rather microscropic analysis of the waveform at splice points, the splice points having generally been predetermined by system constraints regardless of the instantaneous or general characteristics of the waveform being processed. That is, focus has been on the instantaneous values of waveform parameters (such as level, slope, and/or direction i.e. polarity of slope) and on matching, in respect to one or more of those values, the trailing edge of the segment to be terminated with the leading edge of the segment to be next connected. Zero crossing splicing (with and without coincidence of polarity), level matching, overlap schemes and others have been tried, but the improvement in sound quality generally was less than expected.
One example of a digital zero energy level matching scheme is found in the patent to Lee U.S. Pat. No. 3,803,363, where audio signals were converted into digital format and stored in random access memory and read out at a different rate than that at which they were written in memory. When the addresses at which memory access for write and read are taking place came close to converging (which occurred because the write and read rates were different), jumping to a new address which was selected to have a low energy level or "zero crossing."
Another digital scheme which provided for writing and read at different rates in the digital memory conditioned the jump so that, when the addresses converged on examining the signals in storage, the jump is delayed until a suitable match between the waveforms was located. This system as described in the patent to Jusko et al., U.S. Pat. No. 4,121,058, provided additional features such as looping for review of specific portions of the message and interrupting the input storage in order to hold the segment under review in memory.
In each of the foregoing digital schemes of Lee and Jusko et al., the jump of the read pointer to its new address in memory is preselected to utilize substantially all of the memory capacity such that the initial differential between the write and read pointers is constant except for the small variation occasioned by the microscopic examination and adjustment made to provide a signal level match.
Research such as that done by Ian Bennet (May, 1975, Stanford University Doctoral Dissertation in Dept. of Electrical Engineering, A Study of Speech Compression Using Analog Time Domain Sampling Techniques) has shown that in the case where the audio signal is speech, if the signal segments which are stretched or compressed by the processing circuit are synchronous with the pitch periods of the fundamental voiced frequency, there is significant improvement in the sound quality of the processed audio. (Note that if the fundamental voice frequency is extracted and examined, then the pitch period is simply the period of that fundamental.) The complete (unfiltered) speech waveform, however, is not a pure sinusoid, even for voiced sounds, but rather a repetitive pattern each period of which generally begins with a glottal pulse followed by a damped waveform over the remainder of the epoch. Some schemes for pitch synchronous processing have been described, but they generally became quite elaborate and complicated because they require detection of the beginning of epochs (i.e. the glottal pulse) and processing by discarding or repeating one or more integral epochs.
Neuberg (Neuburg, Edward P., "Simple pitch-dependent algorithm for high-quality speech rate changing", J. Accoust. Soc. Am., 63 (2), February 1978) has suggested a new version of the original cut and splice method. Neuberg has proposed that for pitch lowering, the deletion (or in the case of pitch-raising, the repetition) of segments equal in length to an epoch, but regardless of where they started or ended, would produce good results.
This was explained in terms of speech characteristics where, for many voiced sounds, successive epochs contain a repetition of almost identical waveforms of the same pitch period which may continue for many such pitch periods. Thus, deletion of any segment equal in length to the pitch period maintains the cadence of the pitch periods. This approach was stated as leading to a major improvement, which could not result from splicing techniques which focus solely on "microscopic" matching of waveform parameters, and could in theory at least be accomplished more readily and simply than true pitch synchronous systems. Moreover, this approach automatically results in a fair degree of wave matching in the "microscopic" sense, since to the extent that the pitch period and waveform do not change from epoch to epoch, the end of the one segment and the beginning of another (with one or two pitch periods deleted in between) will often match closely in regard to level, slope, etc.