A need exists in the art for a method for time-scale modification of acoustic signals such as speech or music and, in particular, a need exists for such a method which will provide time-scale modification without modifying the pitch or local period of the time-scale modified signals. Thus, a need exists for a method for changing the perceived rate of articulation while ensuring that the local pitch period of the resulting signal remains unchanged, i.e., there are no "Alvin the Chipmunk" effects, and that no audible splicing, reverberation, or other artifacts are introduced.
Specifically, time-scale modification ("TSM") of a signal by time-scale compression, i.e., a method for speeding-up a playback rate of the signal, or by time-scale expansion, i.e., a method for slowing-down the playback rate of the signal, is needed to match the time-scale of the signal with a predetermined duration. For example, TSM can be used: (a) by a radio station to speed up dance music; (b) by a blind person to speed up a recorded lecture; (c) by a student of a foreign language to slow down instructional material; (d) by an editor to synchronize a dubbed sound track with a video signal and to compress them into convenient time slots; (e) by a secretary to slow down or speed up a dictation tape for transcription; (f) by a voicemail system to provide a message to a listener at a faster or slower rate than that at which the message was recorded; and so forth.
When a segment of an input signal is compressed to speed-up the signal, the informational content of the compressed signal is reduced relative to that contained in the input signal to produce an output segment of shorter duration. Ideally, compression should delete an integer multiple of local pitch periods and these deletions should be distributed evenly throughout the input segment. Further, to preserve intelligibility, no phoneme should be removed completely.
When a segment of an input signal is expanded to slow-down the signal, the information content of the expanded signal is increased relative to that contained in the input signal to produce an output segment of longer duration. Ideally, expansion should insert additional pitch periods which are distributed evenly throughout the input segment. This proves to be difficult in practice, however, since the local pitch period varies across phonemes and may be difficult to gauge during nonperiodic portions of a speech signal such as fricatives.
Several methods have been developed in the prior art to provide TSM. Previously, TSM was accomplished using three basic methods: frequency domain processing methods, analysis/synthesis methods, and time-domain processing methods. However, all of these prior art methods have drawbacks. For example, an article entitled "Signal Estimation from Modified Short-Time Fourier Transform" by D. W. Griffin and J. S. Lim in IEEE Transactions on ASSP, Vol. ASSP-32, No. 2, April, 1984, pp. 236-243, introduced a frequency-domain processing method which iteratively synthesizes an output signal having a spectrogram which is a compressed or expanded version of a spectrogram of an input signal. Although the disclosed method works well on almost any acoustic material, it has a drawback in that it requires a large amount of computation. As a result, even though this prior art frequency domain processing method is robust, it is so computationally intensive that it cannot be utilized in many real-time applications.
Analysis/synthesis methods operate by reducing an input speech signal into a set of time varying parameters which can be time-scaled, this being referred to as analysis, and by utilizing the time varying parameters to construct a time-scale modified signal, this being referred to as synthesis. For example, a method suggested by T. F. Quatrieri and R. J. McAulay in an article entitled "Speech Transformations Based on a Sinusoidal Representation," IEEE Transactions on ASSP, Vol. ASSP-34, December, 1986, pp. 1449-1464 utilizes a limited number of sinusoids to model a speech signal. Then, in accordance with the disclosed method, the time-scale of the input signal is modified by varying the rate at which the sequence of sinusoids is played back. Although such analysis/synthesis methods require less computation than frequency domain processing methods, they have a drawback in that they are restricted to signals which can be represented by a limited number of time-varying parameters. As a result, analysis/synthesis methods generally perform poorly on more complex signals, such as speech signals which are corrupted by noise or which contain music.
Time-domain methods operate by inserting or deleting segments of a speech signal. One of the original time-domain methods of TSM was proposed in the 1940s and entailed splicing, i.e., abutting, different regions of a signal at a fixed rate to compress or expand tape recordings. This method results in discontinuities in transitions between inserted or deleted segments and such discontinuities lead to bothersome clicks and pops in the resulting time-scale modified signal.
Several attempts have been made in the art to minimize the effects of inter-segment transitions in a time-scale modified signal by improving the splicing method or by windowing adjacent segments. In general, these methods improve quality at the expense of increasing complexity. One such method of time-domain TSM, i.e., "Time-Domain Harmonic Scaling" ("TDHS"), is disclosed in an article entitled "Time-Domain Algorithms for Harmonic Bandwidth Reduction and Time Scaling of Speech Signals" by D. Malah, IEEE Transactions on ASSP, Vol. ASSP-27, April, 1979, pp. 121-133. This article discloses a TDHS algorithm which improves on the original method of splicing by synchronizing splice points to a local pitch period and by using overlap-add techniques to fade smoothly between the splices. In particular, the TDHS algorithm operates by determining the location of each pitch period in the input signal to be modified and then by segmenting the signal around these pitch periods to achieve the desired modification. In accordance with this TDHS method, an integer number of pitch periods has to be inserted or deleted and it is necessary to maintain a record of the modifications to insure that an appropriate number thereof took place. The TDHS method provides good quality in the class of low complexity time-domain methods.
An alternative to the TDHS method is disclosed in an article entitled "High Quality Time-Scale Modification for Speech" by S. Roucos and A. M. Wilgus, Proceedings ICASSP 86, Tokyo, March, 1985, pp. 493-496. This article discloses a Synchronized Overlap-Add ("SOLA") time-domain processing method which has low complexity and which operates without regard to pitch periods in a speech signal. In accordance with the SOLA method, an input signal is sampled and the samples are segmented at a fixed analysis rate into frames, referred to as windows, and the windows are shifted in time to maintain a predetermined average time-compression or expansion. The windows are then overlap-added at a dynamic synthesis rate to provide an output. In accordance with this method, the input signal is windowed using a fixed, inter-frame shift interval and the output signal is reconstructed using dynamic, inter-frame shift intervals. The inter-frame shift interval used during reconstruction is allowed to vary so that a shift which maximizes the cross-correlation of a current window with previous windows is used. Hence, this method results in a region of overlap which is dynamic between windows and which requires evaluation of a cross-correlation with a variable number of points. As a result, this method allows one to change the relative overlap between windows which, in turn, modifies the time-scale of the input signal without significantly affecting the periods in the signal.
The SOLA method may be understood in light of the following description which should be read in conjunction with FIG. 1. First, with reference to FIG. 1, there are four parameters which are used in the SOLA method: (a) window length W is the duration of windowed segments of the input signal--this parameter is the same for the input and output buffers and represents the smallest unit of the input signal, for example, speech, that is manipulated by the method; (b) analysis shift S.sub.a is the interframe interval between successive windows along the input signal; (c) synthesis shift S.sub.s is the interframe interval between successive windows along the unshifted output signal; and (d) shift search interval K.sub.max is the duration of the interval over which a window may be shifted for purposes of aligning it with previous windows.
The SOLA method modifies the time-scale of an input signal in two steps which are referred to as analysis and synthesis, respectively. The analysis step comprises cutting up the input signal, x[n]--n is a sample index and x[n] is the value of the n.sup.th sample--into possibly overlapping windows--x.sub.m [n] is the n.sup.th sample of the m.sup.th input window. Each input window has a fixed length W and is separated by a fixed analysis distance S.sub.a. In accordance with the SOLA method: ##EQU1##
The synthesis step comprises overlap-adding the windows from the analysis step every S.sub.s samples. Each new window is aligned with the sum of previous windows before being added to reduce discontinuities in the resulting signal which arise from the different interframe intervals which are used during analysis and synthesis, i.e., the windows are overlapped and recombined with the separation between them compressed or expanded so that, on average, windows are separated by a new synthesis distance S.sub.s. The ratio a=S.sub.s /S.sub.a gives the desired compression or expansion rate where a&gt;1 corresponds to expansion and a&lt;1 corresponds to compression. The approximate duration of the modified signal is given by "a * (duration of the input signal)."
The synthesis shift which is actually used for the m.sup.th window x.sub.m [n], i.e., x.sub.m [n]=x[mS.sub.a +n] for n=0, . . . , W-1, is adjusted by an amount k.sub.m which is less than or equal to K.sub.max in order to maximize a similarity measure of data in the overlapping regions before the overlap-add step is carried out. As a result, in accordance with the SOLA method, the output y[i], where i is a sample index and y[i] is the value of the i.sup.th sample, is formed recursively by: EQU y[mS.sub.s +k.sub.m +n].rarw.b.sub.m [n]y[mS.sub.s +k.sub.m +n]+(1-b.sub.m [n])x.sub.m [n] for n=0, . . . , W.sup.m.sub.OV -1 (2) EQU and EQU y[mS.sub.s +k.sub.m +n].rarw.x.sub.m [n] for n=W.sup.m.sub.OV, . . . , W-1(3)
where: W.sup.m.sub.OV is the number of overlap points for the m.sup.th window and W.sup.m.sub.OV =k.sub.m-1 -k.sub.m +W-S.sub.s. Further, shift k.sub.m is selected to maximize a similarity measure, for example, the cross-correlation or average magnitude difference, in the overlap region between the current output y and the m.sup.th window x.sub.m. Still further, b.sub.m [n] is a fading factor between 0 and 1, for example, an averaging or a linear fade, which is chosen to minimize audible splicing artifacts.
The SOLA method has a drawback in that the amount of overlap for the m.sup.th window, W.sup.m.sub.OV, between the output and the m.sup.th analysis window varies with k.sub.m and this complicates the work required to compute the similarity measure and to fade across the overlap region. Also, depending on the shifts k.sub.m, more than two windows may overlap in certain regions and this further complicates the fading computation.
As a result, there is a need in the art for a method for modifying the time-scale of speech, music, or other acoustic material without modifying the pitch, which is robust, and which does not require excessive amounts of computation.