Time-scale modification (TSM) refers to the ability to compress or expand a digital signal in time, while largely preserving the pitch, other dominant frequencies and phase of the signal. Thus, the frequencies present at time t in a digital signal would be the same frequencies present at time at in the processed signal, where α can be <1 (signal is speeded-up, or compressed in time) or α>1 (signal is slowed down, or expanded in time). If the signal is audio, the technique avoids the increase or decrease in pitch (e.g., the “chipmunk” sound in the former case) that results when the signal is merely played back at a different speed.
TSM is well known in the Art and a number of patents and patent applications in this area are listed on the USPTO website. This section discusses the patents and journal articles in the Prior Art believed to be most relevant to the present invention.
There are a number of useful applications of TSM. The following list is intended to be merely illustrative rather than exhaustive. TSM is used most obviously when one wishes to increase the playback speed of recorded digital audio speech. Blind people or people who otherwise suffer reading or sight disabilities often make use of this capability in digital players. General listeners who record lectures will do the same thing. TSM is also used in digital audio compression [Wilson et al., U.S. Pat. No. 6,173,255 B1], a technique wherein the file is first compressed (α<1) and, at a later time, expanded by 1/α. Another application is the suppression of uncorrelated noise, also discussed in [Wilson et al.], and a fourth application involves the synchronization of the audio signal of a video broadcast with the video signal when it is in fast-forward mode. Recently, TSM has also been used in various digital watermarking schemes.
As with much else in digital signal processing, there are two main avenues of approach to TSM: the frequency domain and the time domain. Call the original signal the source and the resulting processed signal the target. In most cases, the signal is conceptually partitioned into short frames to avoid the statistical non-stationarity inherent in most audio and video signals. In a frequency domain approach, the short-term discrete Fourier transform (or its equivalent) is used [Portnoff, 1981] to determine the frequencies in the source frame and in the target frame and an iterative approach may be employed to minimize (in the least squares sense) the distance between the two transforms. Given sufficient time, this approach can provide good results in terms of audio fidelity, but it is computationally very intensive. For example, one minute of music sampled at 44.1 KHz stereo produces approximately 5.3 million digital samples, typically of two bytes each. A typical frame length of 20 milliseconds would contain 882 samples. The analysis of each frame could involve iterating an indeterminate number of Fourier transforms of length up to 1024 (the first power of 2 greater than the frame size) and then repeating that fifty times each second.
[Roucos, et al., 1985] proposed a time-domain method for overlapping and aligning short frames of the target file against the corresponding source frames and then “cross-fading” the two frames together using a weighted average or other digital filter technique to create a final output frame. The acronym given to this technique is SOLA. The key idea in SOLA is the calculation of normalized cross-correlation coefficients r(k) between the digital values of the source frame and those of the target frame in order to determine the best point at which to align the two frames.
From [Roucos, 1985], the general correlation coefficients for the first frame and for frame m+1 are given by:
                              r          ⁡                      (            k            )                          =                                            ∑                              i                =                1                                            L                -                k                                      ⁢                                          y                ⁡                                  (                                      k                    +                    i                                    )                                            ⁢                              x                ⁡                                  (                  i                  )                                                                                        [                                                ∑                                      i                    =                    1                                                        L                    -                    k                                                  ⁢                                                                            y                      2                                        ⁡                                          (                                              k                        +                        i                                            )                                                        ⁢                                                            ∑                                              i                        =                        1                                                                    L                        -                        k                                                              ⁢                                                                  x                        2                                            ⁡                                              (                        i                        )                                                                                                        ]                                      1              /              2                                                          (        1        )                                                      r            ⁡                          (              k              )                                =                                                    ∑                                  i                  =                  1                                                  L                  -                  k                                            ⁢                                                y                  ⁡                                      (                                          mSy                      +                      k                      +                      i                                        )                                                  ⁢                                  x                  ⁡                                      (                                          mSx                      +                      i                                        )                                                                                                      [                                                      ∑                                          i                      =                      1                                                              L                      -                      k                                                        ⁢                                                                                    y                        2                                            ⁡                                              (                                                  mSy                          +                          k                          +                          i                                                )                                                              ⁢                                                                  ∑                                                  i                          =                          1                                                                          L                          -                          k                                                                    ⁢                                                                        x                          2                                                ⁡                                                  (                                                      mSx                            +                            i                                                    )                                                                                                                    ]                                            1                /                2                                                    ⁢                                  ⁢                              k            =            1                    ,          2          ,          …          ⁢                                          ,                      k            ⁢                                                  ⁢            max                                              (        2        )            
Here, the parameter k is the “lag” or offset or shift-value used in aligning one segment against the other. When r(k) is maximum, it is an indication that the two segments are optimally correlated, and the corresponding value of k serves as the alignment point between the two frames, as indicated in FIG. 2. The target frame is synthesized from the source frame such that it is approximately α times the length of the latter, thereby ensuring the proper time duration per frame. The equations for the normalized correlation coefficients used in this technique are shown above and the cross-fading process is shown in the drawing of FIG. 3. Equations (1) and (2) also implicitly indicate that the calculation of r(k) is usually implemented by a computational loop involving multiplications and additions of values in the overlap. Moreover, a second outer loop steps through the values of k from 1 to a predetermined maximum.
Because a high correlation indicates that the dominant frequencies present in the two frames are also well-correlated, this time-domain approach is both intuitive and technically persuasive. Subjective and objective studies have demonstrated that it produces good quality audio even at relatively high compression and expansion factors. However, it too is computationally intensive because, at high sampling rates, it requires the calculation of cross-correlation coefficients of many frames per second, with each frame containing hundreds of possible alignment points (shift-values) and, for each such point, the calculation of r(k) will involve hundreds of additions and multiplications and divisions. Sampling at the standard CD rate of 44.1 kHz requires that just the calculation of the values of r(k) alone will require tens of millions of arithmetic operations per second. This is a direct consequence of the definitions of equations (1) and (2).
Significant improvements both in time and simplicity are described in [Wong et al.] and [Wilson et al., U.S. Pat. No. 6,173,255 B1]. In the approach given there, only the envelopes of the digital waveforms are used to calculate the modified cross-correlation coefficients. Since the computations involve only the signs of the signal values, the resulting formula for the modified r(k) is simplified, particularly with respect to the normalization factors (which reduce to a single division) and the option of replacing multiplications in the equations (1) and (2) by an XOR operation. The modified expressions for frame m+1 are shown as Equation (3) below. This technique is called “envelope matching” (EM) in [Wong et al.] or “1-bit correlation” [Wilson et al., U.S. Pat. No. 6,173,255 B1].
                                                                                          r                  ⁡                                      (                    k                    )                                                  =                                                                            ∑                                              i                        =                        1                                                                    L                        -                        k                                                              ⁢                                                                  sign                        ⁡                                                  (                                                      y                            ⁡                                                          (                                                              mSy                                +                                k                                +                                i                                                            )                                                                                )                                                                    ⁢                                              sign                        ⁡                                                  (                                                      x                            ⁡                                                          (                                                              mSx                                +                                i                                                            )                                                                                )                                                                                                                          L                    -                    k                                                                                                                          =                                                      ∑                                          XOR                      ⁡                                              (                                                                              sign                            ⁡                                                          (                                                              y                                ⁡                                                                  (                                                                      mSy                                    +                                    k                                    +                                    i                                                                    )                                                                                            )                                                                                ,                                                      ~                                                          sign                              ⁡                                                              (                                                                  x                                  ⁡                                                                      (                                                                          mSx                                      +                                      i                                                                        )                                                                                                  )                                                                                                                                    )                                                                                                  L                    -                    k                                                                                      ⁢                                  ⁢                              k            =            1                    ,          2          ,          …          ⁢                                          ,                      k            ⁢                                                  ⁢            max                                              (        3        )            
In [Wong et al.] it was also pointed out that the zero-crossings of both the source and target signals were critical for achieving even greater computational savings.
In addition, [Wong et al.] provide formulas for the recursive calculation of r(k) and related results. These ideas, however, depend on first finding the zero-crossings of both the source and target files, merging and sorting them and determining the set of zero-crossing points that are not common to both. Then this set must be updated for each k. This task itself can be computationally complex. If, for example, the signal consists of two stereo channels that have been digitized at 44.1 kHz, and if even ⅕ of the Nyquist frequency is present (i.e., approximately 4400 Hz), the number of zero crossings per second per channel may number in the thousands. Since the target signal attempts to reproduce the same frequencies, it will have approximately the same number of zero-crossings per unit of time. Thus, to produce, say, one-half second of processed audio from one second of the source file would involve (by rough approximation) sorting sets with a total of 8800.times.4400 elements per second of source audio, prior to calculation of the correlation coefficients themselves. This places a significant burden on the processor, especially when operating in real-time in an inexpensive digital player.
In [Wilson et al., U.S. Pat. No. 6,173,255 B1] an innovation is taught wherein the signs of the signal values are packed as individual bits into machine words and the computation of r(k) is performed using the XOR operation on pairs of such words, one element of the pair from the source signal, the other from the target. This method avoids ordinary multiplication and has the advantage of replacing with a single operation the serial application of as many as 16 or 32 or 64 logical operations performed serially, depending on machine word size. However, the method still requires that the number of ones or zeros generated by each XOR operation be counted, and that the bits be packed appropriately. The method also teaches that all the r(k) be calculated in this manner for every k in order to determine the maximum, and the normalization factor must be part of the calculation for a correct comparison.
In [Bialick, U.S. Pat. No. 4,864,620], a method is described which uses the Average Magnitude Difference Function to calculate correlation coefficients for the SOLA method. The chief advantage of this method is that multiplications are not required. However, normalization in order to directly compare r(k) for different k is still needed, and so is the full calculation of r(k) for each k.
In [Patent Application 2005/0038534 A1 (Sakurai)], a method similar to that of [Wong et al.] is taught, with the additional feature that the interval over which the correlation coefficients are computed is independent of k and therefore no normalization is required. The claims involve in part an avoidance of normalization and an additional speed-up factor of approximately two because the interval of calculation of r(k) is only half the nominal length. (A practitioner in the field might observe that the reduction in computation due to this smaller “cross-correlation buffer” is in fact not as great as claimed, because the more usual approach uses a decreasing overlap as k increases, so the average overlap length, which is the determining factor here, is comparable in the two cases). Here, too, r(k) is calculated for all the k in the range specified. This can vary from, say, 80 k's for 8 Khz sampling to as many as 800 or more for DVD quality sampling. The precise number depends on the implementation and audio considerations.
In [Patent Application 2005/0038534 A1, W. Y. Choi], a method based on [Roucos, 1985] is described. The innovations taught are essentially two: the method skips some of the k's when computing the r(k), and for each r(k), the method uses a reduced subset of the sample values. No data are presented to justify the two modifications in terms of audio quality, although it is stated that the errors introduced are ignorable. Moreover, for those r(k) that are computed, full calculation and normalization is taught in the form of equation (2).
While these innovations have increased computational efficiency, the need for even faster methods has been driven by the rising standards for recordings on various media. For example, the standard for music CDs is 44.1 kHz per stereo channel and the standard for DVD recordings is 96 kHz per channel. Even monophonic speech is now routinely recorded at these rates, rather than at the much lower rates of twenty years ago. The equations (1), (2) and (3) above show that both of the two computational loops involved for each frame grow in rough proportion to the sampling rate, resulting in overall growth in computation as the square of the sampling rate. Thus, while innovation has been lively in the area of TSM for the past twenty-five years, the need for even more efficient methods remains. This is particularly true with the introduction of handheld digital audio and video players that run on small capacity batteries and therefore incorporate low-power processors without floating-point arithmetic units in hardware. Consequently, their performance does not approach that of desktop or laptop computers, yet their tasks typically have real-time performance requirements. What are needed are methods, computer readable media and computer systems for a faster and practical approach to time-scale modification of digital signals.