Embodiments according to the invention relate to an apparatus, a method and a computer program for manipulating an audio signal comprising a transient event.
In the following, typical application scenarios will be described, in which embodiments according to the invention may be applied.
In current audio signal processing systems, audio signals are often processed using digital techniques. Specific signal portions such as transients, for example, place special requirements upon digital signal processing.
Transient events (or “transients”) are events in a signal during which the energy of the signal in the whole band or in a certain frequency range is rapidly changing, i.e., its energy is rapidly increasing or rapidly decreasing. Characteristic features of specific transients (transient events) can be found in the distribution of signal energy in the spectrum. Typically, the energy of the audio signal during a transient event is distributed over the whole frequency range, while in non-transient signal portions the energy is normally concentrated in a low frequency portion of the audio signal or in one or more specific bands. This means that a non-transient signal portion, which is also called a stationary or “tonal” signal portion, has a spectrum, which is non-flat. Also, the spectrum of the transient signal portion is typically chaotic and “non-predictable” (for example when knowing a spectrum of a signal portion preceding the transient signal portion). In other words, the energy of the signal is included in a comparatively small number of spectral lines or spectral bands, which are strongly emphasized over a noise floor of an audio signal. In a transient portion however, the energy of the audio signal will be distributed over many different frequency bands and, specifically, will be distributed in a high frequency portion so that a spectrum for the transient portion of the audio signal will be comparatively flat and will typically be flatter than a spectrum of a tonal portion of the audio signal. Nevertheless, it should be noted that there are other types of signals having a flat spectrum, like, for example, noise-like signals, which signals do not represent a transient. However, while spectral bins of noise-like signals have uncorrelated or weakly correlated phase values, there is often a very significant phase correlation of spectral bins in the presence of a transient.
Typically, a transient event is a strong change in a time domain representation of the audio signal, which means that the signal will include many higher frequency components when a Fourier decomposition is performed. An important feature of these many higher harmonics is that the phases of these higher harmonics are in a very specific mutual relationship, so that the superposition of all the harmonics will result in a rapid change of signal energy (when considered in the time domain). In other words, there exists a strong correlation across the spectrum in the proximity of a transient event. The specific phase situation among all harmonics can also be termed as a “vertical coherence”. This “vertical coherence” is related to a time/frequency spectrogram representation of the signal where a horizontal direction corresponds to an evolution of the signal over time and where a vertical dimension describes the dependency over the frequency of the spectral components in a short-time spectrum over frequency.
If, for example, changes are performed over large time domains, e.g. by quantization, said changes will influence the entire block. Since transients are characterized by a short-term increase in energy, this energy will probably be smeared, when the block is changed, across the entire region represented by the block.
The problem becomes particularly evident also when the reproduction speed of a signal is changed while the pitch is maintained or when the signal is transposed while the original duration of the reproduction is maintained. Both may be accomplished using a phase vocoder or a method such as (P)SOLA (refer to references [A1] to [A4] regarding this issue). The latter is achieved by reproducing the stretched signal, accelerated by the factor of the time stretching. With time-discrete signal representation, this corresponds to downsampling the signal by the stretch factor while maintaining the sampling frequency. Methods of time stretching such as the phase vocoder are actually suited only for stationary or quasi-stationary signals, since transients are “smeared” in time by dispersion. The phase vocoder impairs the so-called vertical coherence properties (related to a time/frequency spectrogram representation) of the signal.
Time stretching of audio signals plays an important role in both, entertainment and arts. Common algorithms are based on overlap and add (OLA) techniques, such as the Phase Vocoder (PV), Synchronous Overlap Add (SOLA), Pitch Synchronous Overlap Add (PSOLA), and Waveform Similarity Overlap Add (WSOLA). While these algorithms are capable of changing the replay speed of audio signals while preserving their original pitch, transients are not well preserved. Time stretching of an audio signal without altering its pitch using OLA needs the separate processing of the transients and the sustained signal portions in order to avoid transient dispersion [B1] and time domain aliasing which often occurs with WSOLA and SOLA. A challenge is issued by the task to stretch a combination of a very tonal signal such as a pitch pipe and a percussive signal such as castanets.
In the following, reference will be made to some conventional approaches in order to provide the background of the present invention.
Some current methods stretch the time around the transients more intensely so as to have to perform no or only little time stretching over the duration of the transient (see, for example, references [5] to [8]).
The following articles and patents describe methods of time and/or pitch manipulation: [A1], [A2], [A3], [A4], [A5], [A6], [A7], [A8].
In [B2] a method is proposed that approximately preserves the envelope of a signal in the time stretched version as well as its spectral characteristics. This approach expects a time dilated percussive event to decay slower than the original.
Several widely known methods allow for a distinguished processing of transients and stationary signal components, for instance, the modelling of a signal as summation of sines, transients, and noise (S+T+N) [B4, B5]. In order to preserve transients after time scale modification, all three parts are stretched separately. This technique is capable of perfectly preserving transient components of audio signals. The resulting sound is, however, often perceived as unnatural.
Further approaches vary the amount of time stretching and set it to one during the transient time or lock the phase on the transient event [B3, B6, B7].
The paper [B8] demonstrates how transients can be preserved in time and frequency stretching with the PV. In that approach, transients were cut out from the signal before it was stretched. The removal of the transient parts resulted in gaps within the signal which were stretched by the PV process. After the stretching, the transients were re-added to the signal with a surrounding that fitted the stretched gaps.
In view of the above, there is a need for a concept of manipulating an audio signal comprising a transient event which provides for an output signal of improved perceived quality.