This invention relates generally to digital audio signal processing. More particularly, it relates to a method for modifying the output rate of audio signals without changing the pitch, using an improved synchronized overlap-and-add (SOLA) algorithm.
A variety of applications require modification of the playback rate of audio signals. Techniques falling within the category of Time Scale Modification (TSM) include both compression (i.e., speeding up) and expansion (i.e., slowing down). Audio compression applications include speeding up radio talk shows to permit more commercials, allowing users or disc jockeys to select a tempo for dance music, speeding up playback rates of dictation material, speeding up playback rates of voicemail messages, and synchronizing audio and video playback rates. Regardless of the type of input signalxe2x80x94speech, music, or combined speech and musicxe2x80x94the goal of TSM is to preserve the pitch of the input signal while changing its tempo. Clearly, simply increasing or decreasing the playing rate necessarily changes pitch.
The synchronized overlap-and-add technique was introduced in 1985 by S. Roucos and A. M. Wilgus in xe2x80x9cHigh Quality Time Scale Modification for Speech,xe2x80x9d IEEE Int. Conf. ASSP, 493-496, and is still the foundation for many recently developed techniques. The method is illustrated schematically in FIG. 1A. A digital input signal 10 is obtained by digitally sampling an analog audio signal to obtain a series of time domain samples x(t). Input signal 10 is divided into overlapping windows, blocks, or frames 12, each containing N samples and offset from one another by Sa samples (xe2x80x9caxe2x80x9d for analysis). Scaled output 14 contains samples y(t) of the same overlapping windows, offset from each other by a different number of samples, Ss (xe2x80x9csxe2x80x9d for synthesized). Output 14 is generated by successively overlapping input windows 12 with a different time lag than is present in input 10. The time scale ratio xcex1 is defined as Sa/Ss; xcex1 greater than 1 for compression and xcex1 less than 1 for expansion. A weighting function, such as a linear cross-fade, illustrated in FIG. 1B, is used to combine overlapped windows. To overlap an input block 16 with an output block 18, samples in the overlapped regions of input block 16 are scaled by a linearly increasing function, while samples in output block 18 are scaled by a linearly decreasing function, to generate new output signal 20. Note that the SOLA method changes the overall rate of the signal without changing the rates of individual windows, thereby preserving pitch.
To maximize quality of the resulting signal 14, frames are not overlapped at a predefined separation distance. The actual offset is chosen, typically within a given range, to maximize a similarity measure between the two overlapped frames, ensuring optimal sound quality. For each potential overlap offset within a predefined search range, the similarity measure is calculated, and the chosen offset is the one with the highest value of the similarity measure. For example, a correlation function between the two frames may be computed by multiplying x(t) and y(t) at each offset. This technique produces a signal of high quality, i.e., one that sounds natural to a listener, and high intelligibility, i.e., one that can be understood easily by a listener. A variety of quality and intelligibility measures are known in the art, such as total harmonic distortion (THD).
The basic SOLA framework permits a variety of modifications in window size selection, similarity measure, computation methods, and search range for overlap offset. U.S. Pat. No. 5,479,564, issued to Vogten et al., discloses a method for selecting the window of the input signal based on a local pitch period. A speaker-dependent method known as WSOLA-SD is disclosed in U.S. Pat. No. 5,828,995, issued to Satyamurti et al. WSOLA-SD selects the frame size of the input signal based on the pitch period. A drawback of these and other pitch-dependent methods is that they can only be used with speech signals, and not with music. Furthermore, they require the additional steps of determining whether the signal is voiced or unvoiced, which can change for different portions of the signal, and for voiced signals, determining the pitch. The pitch of speech signals is often not constant, varying in multiples of a fundamental pitch period. Resulting pitch estimates require artificial smoothing to move continuously between such multiples, introducing artifacts into the final output signal.
Typically, the location within an existing output frame at which a new input frame is overlapped is selected, based on the calculated similarity measure. However, some SOLA methods use the similarity measure to select overlap locations of input blocks. U.S. Pat. No. 5,175,769, issued to Hejna, Jr. et al., discloses a method for selecting the location of input blocks within a predefined range. The method of Hejna, Jr. requires fewer computational steps than does the original SOLA method. However, it introduces the possibility of skipping completely over portions of the input signal, particularly at high compression ratios (i.e., xcex1xe2x89xa72). A speech rate modification method described in U.S. Pat. Nos. 5,341,432 and 5,630,013, both issued to Suzuki et al., determines the optimal overlap of two successive input frames that are then overlapped to produce an output signal. In the traditional SOLA method, in which input frames are successively overlapped onto output frames, each output frame can be a sum of all previously overlapped frames. With the method of Suzuki et al., however, input frames are overlapped only onto each other, preventing the overlap of multiple frames. In some cases, this limited overlap may decrease the quality of the resultant signal. Thus selecting the offset within the output signal is the most reliable method, particularly at high compression ratios.
Computational cost of the method varies with the input sampling rate and compression ratios. High sampling rates are desirable because they produce higher quality output signals. In addition, high compression ratios require high processing rates of input samples. For example, CD quality audio corresponds to a 44.1 kHz sampling rate; at a compression ratio of xcex1=4, approximately 176,000 input samples must be processed each second to generate CD quality output. In order to process signals at high input sampling rates and high compression ratios, computational efficiency of the method is essential. Calculating the similarity measure between overlapping input and output sample blocks is the most computationally demanding part of the algorithm. A correlation function, one potential similarity measure, is calculated by multiplying corresponding samples of input and output blocks for every possible offset of the two blocks. For an input frame containing N samples, N2 multiplication operations are required. At high input sampling rates, for N on the order of 1000, performing N2 operations for each input frame is unfeasible.
As a result, the trend in SOLA is to simplify the computation to reduce the number of operations performed. One solution is to use an absolute error metric, which requires only subtraction operations, rather than a correlation function, which requires multiplication. U.S. Pat. No. 4,864,620, issued to Bialick, discloses a method that uses an Average Magnitude Difference Function (AMDF) to select the optimal overlap. The AMDF averages the absolute value of the difference between the input and output samples for each possible offset, and selects the offset with the lowest value. U.S. Pat. No. 5,832,442, issued to Lin et al., discloses a method employing an equivalent mean absolute error in overlap. While absolute error methods are significantly less computationally demanding, they are not as reliable or as well accepted as correlation functions in locating optimal offsets. A level of accuracy is sacrificed for the sake of computational efficiency.
The overwhelming majority of existing SOLA methods reduce complexity by selecting a limited search range for determining optimal overlap offsets. For example, U.S. Pat. No. 5,806,023, issued to Satyamurti, discloses a method in which the optimal overlap is selected within a predefined search range. The Bialick patent mentioned above uses the input signal pitch period to determine the search range. In xe2x80x9cAn Edge Detection Method for Time Scale Modification of Acoustic Signals,xe2x80x9d by Rui Ren, an improved SOLA technique is introduced. Again, the method of Ren uses a small search window, in this case an order of magnitude smaller than the input frame, to locate the optimal offset. It also uses edge detection and is therefore specific to a type of signal, generating different overlaps for different types of signals.
A prior art method that limits the search range for optimal overlap offset is illustrated in the example of FIG. 2. The best position within an output block 24 y(t) to overlap an input block 22 x(t) is located. Output block y(t) has a length of So+H+L samples, and input block x(t) has a length of So samples. In this case, the search range over which the similarity measure is computed is H+L samples; that is, the range of potential lag values is equal to the difference in length between the two sample blocks being compared. Three possible values of overlap lags are illustrated: xe2x88x92L, 0, and +H. In this method, the similarity measure 26 has a rectangular envelope shape over the range of lag values for which it is evaluated. This means that when averaged across all possible signals, the position of maximum value of the similarity measure has an equal or flat probability distribution within the range of lag values for which it is evaluated. This feature is not dependent on the type of similarity measure used, but is instead a result of comparing an equal number of samples from both segments for all potential lag values.
By limiting the search range, all of the prior art methods are likely to predict overlap offset incorrectly during quickly changing or complicated mixed signals. In addition, by predetermining a relatively narrow search range, these methods essentially fix the compression ratio to be very close to a known value. Thus they are incapable of processing input signals sampled at highly varying rates. In general, they are best for small overlaps of relatively long frames, which cannot produce high (i.e., xcex1xe2x89xa72) compression ratios.
There is a need, therefore, for an improved time scale modification method that is computationally feasible, highly accurate, and applicable to a wide range of audio signals.
Accordingly, it is a primary object of the present invention to provide a time scale modification method for altering the playback rate of audio signals without changing their pitch.
It is a further object of the invention to provide a time scale modification method that can process speech, music, or combined speech and music signals.
It is an additional object of the invention to provide a time scale modification method that generates output at a constant, real-time rate from input samples at a variable, non-real-time rate.
It is another object of the present invention to provide a time scale modification method that provides a variable compression ratio, determined by the required output rate and variable input rate.
It is a further object of the invention to provide a time scale modification method that can overlap input and output frames over the entire range of the output frame, and not just over a specified narrow search range, while remaining computationally efficient. Successive frames may even be inserted behind previous frames, allowing for high quality output at high compression ratios.
It is an additional object of the invention to provide a time scale modification method that uses a correlation function to determine optimal offset of overlapped input and output frames. A correlation function is well known to be a maximum likelihood estimator, unlike absolute error metric methods.
Finally, it is an object of the present invention to provide a time scale modification method that does not require determination of pitch or other signal characteristics.
These objects and advantages are attained by a method for time scale modification of a digital audio input signal, containing input samples, to form a digital audio output signal, containing output samples. The method has the following steps: selecting an input block of N/2 input samples; selecting an output block of N/2 output samples; determining an optimal offset T for overlapping the beginning of the input block with the beginning of the output block; and overlapping the blocks, offsetting the input block beginning from the output block beginning by T samples. T has a possible range of xe2x88x92N/2 to N/2, and is calculated by taking discrete frequency transforms of the N/2 input samples and the N/2 output samples, and then computing their correlation function. The maximum value of an inverse discrete frequency transform of the correlation function occurs for a value of offset t=T. The frequency transform is preferably a discrete Fourier transform, but it may be any other frequency transform such as a discrete cosine transform, a discrete sine transform, a discrete Hartley transform, or a discrete transform based on wavelet basis functions. Preferably, N/2 zeroes are appended to the input samples and to the output samples before the frequency transform is performed, to prevent wrap-around artifacts. Preferably, the correlation function is Z(k)=X*(k)xc2x7Y(k), for k=0, . . . , N/2xe2x88x921, where X*(k) are the complex conjugates of the frequency transformed input samples, Y(k) are the frequency transformed output samples, and Z(k) are the products of their complex multiplication. Preferably, Z(k) is normalized before the inverse frequency transform is performed.
The output signal is preferably output at a constant, real-time rate, which determines the selection of the beginning of the output block. The input signal may be obtained at a variable rate. Preferably, the input block size and location are selected independently of a pitch period of the input signal. The input block and output block are overlapped by applying a weighting function, preferably a linear function.
The present invention also provides a method for time scale modification of a multi-channel digital audio input signal, such as a stereo signal, to form a multi-channel digital audio output signal. The method has the following steps: obtaining individual input channels, independently modifying each input channel, and combining the output channels to form the multi-channel digital audio output signal. The individual channels can be obtained either by separating a multi-channel input signal into individual input channels, or by generating multiple input channels from a single-channel input signal. Each input channel is independently modified according to the above method for time scale modification of a digital input signal. There is no correlation between overlapped blocks of the different audio channels; corresponding samples of input channels no longer correspond in the output signals. However, the listener is able to integrate perceptually the different channels to accommodate the lack of correspondence.
Also provided is a digital signal processor containing a processing unit configured to carry out method steps for implementing the time scale modification method described above.