The present invention is related to audio coding and, in particular, to parameterized audio coding schemes, which are applied in vocoders.
One class of vocoders is phase vocoders. A tutorial on phase vocoders is the publication “The Phase Vocoder: A tutorial”, Mark Dolson, Computer Music Journal, Volume 10, No. 4, pages 14 to 27, 1986. An additional publication is “New phase vocoder techniques for pitch-shifting, harmonizing and other exotic effects”, L. Laroche and M. Dolson, proceedings 1999, IEEE workshop on applications of signal processing to audio and acoustics, New Paltz, N.Y., Oct. 17 to 20, 1999, pages 91 to 94.
FIGS. 5 to 6 illustrate different implementations and applications for a phase vocoder. FIG. 5 illustrates a filter bank implementation of a phase vocoder, in which an audio signal is provided at an input 500, and where, at an output 510, a synthesized audio signal is obtained. Specifically, each channel of the filter bank illustrated in FIG. 5 comprises a band pass filter 501 and a subsequently connected oscillator 502. Output signals of all oscillators 502 from all channels are combined via a combiner 503, which is illustrated as an adder. At the output of the combiner 503, the output signal 510 is obtained.
Each filter 501 is implemented to provide, on the one hand, an amplitude signal A(t), and on the other hand, the frequency signal f(t). The amplitude signal and the frequency signal are time signals. The amplitude signal illustrates a development of the amplitude within a filter band over time and the frequency signal illustrates the development of the frequency of a filter output signal over time.
As schematic implementation of a filter 501 is illustrated in FIG. 6. The incoming signal is routed into two parallel paths. In one path, the signal is multiplied by a sign wave with an amplitude of 1.0 and a frequency equal to the center frequency of the band pass filter as illustrated at 551. In the other path, the signal is multiplied by a cosine wave of the same amplitude and frequency as illustrated at 551. Thus, the two parallel paths are identical except for the phase of the multiplying wave form. Then, in each path, the result of the multiplication is fed into a low pass filter 553. The multiplication operation itself is also known as a simple ring modulation. Multiplying any signal by a sine (or cosine) wave of constant frequency has the effect of simultaneously shifting all the frequency components in the original signal by both plus and minus the frequency of the sine wave. If this result is now passed through an appropriate low pass filter, only the low frequency portion will remain. This sequence of operations is also known as heterodyning. This heterodyning is performed in each of the two parallel paths, but since one path heterodynes with a sine wave, while the other path uses a cosine wave, the resulting heterodyned signals in the two paths are out of phase by 90°. The upper low pass filter 553, therefore, provides a quadrate signal 554 and the lower filter 553 provides an in-phase signal. These two signals, which are also known as I and Q signals, are forwarded into a coordinate transformer 556, which generates a magnitude/phase representation from the rectangular representation.
The amplitude signal is output at 557 and corresponds to A(t) from FIG. 5. The phase signal is input into a phase unwrapper 558. At the output of element 558 there does not exist a phase value between 0 and 360° but a phase value, which increases in a linear way. This “unwrapped” phase value is input into a phase/frequency converter 559 which may, for example, be implemented as a phase-difference-device which subtracts a phase at a preceding time instant from phase at a current time instant in order to obtain the frequency value for the current time instant.
This frequency value is added to a constant frequency value fi of the filter channel i, in order to obtain a time-varying frequency value at an output 560.
The frequency value at the output 560 has a DC portion fi and a changing portion, which is also known as the “frequency fluctuation”, by which a current frequency of the signal in the filter channel deviates from the center frequency fi.
Thus, the phase vocoder as illustrated in FIG. 5 and FIG. 6 provides a separation of spectral information and time information. The spectral information is comprised in the location of the specific filter bank channel at frequency fi, and the time information is in the frequency fluctuation and in the magnitude over time.
Another description of the phase vocoder is the Fourier transform interpretation. It consists of a succession of overlapping Fourier transforms taken over finite-duration windows in time. In the Fourier transform interpretation, attention is focused on the magnitude and phase values for all of the different filter bands or frequency bins at the single point in time. While in the filter bank interpretation, the re-synthesis can be seen as a classic example of additive synthesis with time varying amplitude and frequency controls for each oscillator, the synthesis, in the Fourier implementation, is accomplished by converting back to real-and-imaginary form and overlap-adding the successive inverse Fourier transforms. In the Fourier interpretation, the number of filter bands in the phase vocoder is the number of frequency points in the Fourier transform. Similarly, the equal spacing in frequency of the individual filters can be recognized as the fundamental feature of the Fourier transform. On the other hand, the shape of the filter pass bands, i.e., the steepness of the cutoff at the band edges is determined by the shape of the window function which is applied prior to calculating the transform. For a particular characteristic shape, e.g., Hamming window, the steepness of the filter cutoff increases in direct proportion to the duration of the window.
It is useful to see that the two different interpretations of the phase vocoder analysis apply only to the implementation of the bank of band pass filters. The operation by which the outputs of these filter are expressed as time-varying amplitudes and frequencies is the same for both implementations. The basic goal of the phase vocoder is to separate temporal information from spectral information. The operative strategy is to divide the signal into a number of spectral bands and to characterize the time-varying signal in each band.
Two basic operations are particularly significant. These operations are time scaling and pitch transposition. It is possible to slow down a recorded sound simply by playing it back at a lower sample rate. This is analogous to playing a tape recording at a lower playback speed. But, this kind of simplistic time expansion simultaneously lowers the pitch by the same factor as the time expansion. Slowing down the temporal evolution of a sound without altering its pitch necessitates an explicit separation of temporal and spectral information. As noted above, this is precisely what the phase vocoder attempts to do. Stretching out the time-varying amplitude and frequency signals A(t) and f(t) to FIG. 5a does not change the frequency of the individual oscillators at all, but it does slow down the temporal evolution of the composite sound. The result is a time-expanded sound with the original pitch. The Fourier transform view of time scaling is so that, in order to time-expand a sound, the inverse FFTs can simply be spaced further apart than the analysis FFTs. As a result, spectral changes occur more slowly in the synthesized sound than in the original in this application, and the phase is rescaled by precisely the same factor by which the sound is being time-expanded.
The other application is pitch transposition. Since the phase vocoder can be used to change the temporal evolution of a sound without changing its pitch, it should also be possible to do the reverse, i.e., to change the pitch without changing the duration. This is either done by time-scale using the desired pitch-change factor and then to play the resulting sounds back at the wrong sample rate or to down-sample by a desired factor and playback at unchanged rate. For example, to raise the pitch by an octave, the sound is first time-expanded by a factor of 2 and the time-expansion is then played at twice the original sample rate.
The vocoder (or ‘VODER’) was invented by Dudley as a manually operated synthesizer device for generating human speech [2]. Some considerable time later the principle of its operation was extended towards the so-called phase vocoder [3] [4]. The phase vocoder operates on overlapping short time DFT spectra and hence on a set of sub band filters with fixed center frequencies. The vocoder has found wide acceptance as an underlying principle for manipulating audio files. For instance, audio effects like time-stretching and pitch transposing are easily accomplished by a vocoder [5]. Since then, a lot of modifications and improvements to this technology have been published. Specifically the constraints of having fixed frequency analysis filters was dropped by adding a fundamental frequency (‘f0’) derived mapping, for example in the ‘STRAIGHT’ vocoder [6]. Still, the prevalent use case remained to be speech coding/processing.
Another area of interest for the audio processing community has been the decomposition of speech signals into modulated components. Each component consists of a carrier, an amplitude modulation (AM) and a frequency modulation (FM) part of some sort. A signal adaptive way of such decomposition was published e.g. in [7] suggesting the use of a set of signal adaptive band pass filters. In [8] an approach that utilizes AM information in combination with a ‘sinusoids plus noise’ parametric coder was presented. Another decomposition method was published in [9] using the so-called ‘FAME’ strategy: here, speech signals have been decomposed into four bands using band pass filters in order to subsequently extract their AM and FM content. Most recent publications also aim at reproducing audio signals from AM information (sub band envelopes) alone and suggest iterative methods for recovery of the associated phase information which predominantly contains the FM [10].
Our approach presented herein is targeting at the processing of general audio signals hence also including music. It is similar to a phase vocoder but modified in order to perform a signal dependent perceptually motivated sub band decomposition into a set of sub band carrier frequencies with associated AM and FM signals each. We like to point out that this decomposition is perceptually meaningful and that its elements are interpretable in a straight forward way, so that all kinds of modulation processing on the components of the decomposition become feasible.
To achieve the goal stated above, we rely on the observation that perceptually similar signals exist. A sufficiently narrow-band tonal band pass signal is perceptually well represented by a sinusoidal carrier at its spectral ‘center of gravity’ (COG) position and its Hilbert envelope. This is rooted in the fact that both signals approximately evoke the same movement of the basilar membrane in the human ear [11]. A simple example to illustrate this is the two-tone complex (1) with frequencies f1 and f2 sufficiently close to each other so that they perceptually fuse into one (over-) modulated components1(t)=sin(2πf1t)+sin(2πf2t)  (1)
A signal consisting of a sinusoidal carrier at a frequency equal to the spectral COG of st and having the same absolute amplitude envelope as st is sm according to (2)
                                          s            m                    ⁡                      (            t            )                          =                  2          ⁢                                          ⁢                                    sin              ⁡                              (                                  2                  ⁢                  π                  ⁢                                                                                    f                        1                                            +                                              f                        2                                                              2                                    ⁢                  t                                )                                      ·                                                        cos                ⁡                                  (                                      2                    ⁢                    π                    ⁢                                                                                                                                                f                            1                                                    -                                                      f                            2                                                                                                                      2                                        ⁢                    t                                    )                                                                                                      (        2        )            
In FIG. 9b (top and middle plot) the time signal and the Hilbert envelope of both signals are depicted. Note the phase jump of π in the first signal at zeros of the envelope as opposed to the second signal. FIG. 9a displays the power spectral density plots of the two signals (top and middle plot).
Although these signals are considerably different in their spectral content their predominant perceptual cues—the ‘mean’ frequency represented by the COG, and the amplitude envelope—are similar. This makes them perceptually mutual substitutes with respect to a band-limited spectral region centered at the COG as depicted in FIG. 9a and FIG. 9b (bottom plots). The same principle still holds true approximately for more complicated signals.
Generally, modulation analysis/synthesis systems that decompose a wide-band signal into a set of components each comprising carrier, amplitude modulation and frequency modulation information have many degrees of freedom since, in general, this task is an ill-posed problem. Methods that modify subband magnitude envelopes of complex audio spectra and subsequently recombine them with their unmodified phases for re-synthesis do result in artifacts, since these procedures do not pay attention to the final receiver of the sound, i.e., the human ear.
Furthermore, applying very long FFTs, i.e., very long windows in order to obtain a fine frequency resolution concurrently reduces the time resolution. On the other hand transient signals would not require a high frequency resolution, but would necessitate a high time resolution, since, at a certain time instant the band pass signals exhibit strong mutual correlation, which is also known as the “vertical coherence”. In this terminology, one imagines a time-spectrogram plot where in the horizontal axis, the time variable is used and where in the vertical axis, the frequency variable is used. Processing transient signals with a very high frequency resolution will, therefore, result in a low time resolution, which, at the same time means an almost complete loss of the vertical coherence. Again, the ultimate receiver of the sound, i.e., the human ear is not considered in such a model.
The publication [22] discloses an analysis methodology for extracting accurate sinusoidal parameters from audio signals. The method combines modified vocoder parameter estimation with currently used peak detection algorithms in sinusoidal modeling. The system processes input frame by frame, searches for peaks like a sinusoidal analysis model but also dynamically selects vocoder channels through which smeared peaks in the FFT domain are processed. This way, frequency trajectories of sinusoids of changing frequency within a frame may be accurately parameterized. In a spectral parsing step, peaks and valleys in the magnitude FFT are identified. In a peak isolation, the spectrum is set to zero outside the peak of interest and both the positive and negative frequency versions of the peak are retained. Then, the Hilbert transform of this spectrum is calculated and, subsequently, the IFFT of the original and the Hilbert transformed spectra are calculated to obtain two time domain signals, which are 90° out of phase with each other. The signals are used to get the analytic signal used in vocoder analysis. Spurious peaks can be detected and will later be modeled as noise or will be excluded from the model.
Again, perceptual criteria such as a varying band width of the human ear over the spectrum, i.e., such as small band width in the lower part of the spectrum and higher band width in the upper part of the spectrum are not accounted for. Furthermore, a significant feature of the human ear is that, as discussed in connection with FIGS. 9a, 9b and 9c the human ear combines sinusoidal tones within a band width corresponding to the critical band width of the human ear so that a human being does not hear two stable tones having a small frequency difference but perceives one tone having a varying amplitude, where the frequency of this tone is positioned between the frequencies of the original tones. This effect increases more and more when the critical band width of the human ear increases.
Furthermore, the positioning of the critical bands in the spectrum is not constant, but is signal-dependent. It has been found out by psychoacoustics that the human ear dynamically selects the center frequencies of the critical bands depending on the spectrum. When, for example, the human ear perceives a loud tone, then a critical band is centered around this loud tone. When, later, a loud tone is perceived at a different frequency, then the human ear positions a critical band around this different frequency so that the human perception not only is signal-adaptive over time but also has filters having a high spectral resolution in the low frequency portion and having a low spectral resolution, i.e., high band width in the upper part of the spectrum.