Perceptual Audio Coding
The perceptual audio coding seen to date follows several common themes, including the use of time/frequency-domain processing, redundancy reduction (entropy coding), and irrelevancy removal through the pronounced exploitation of perceptual effects [1]. Typically, the input signal is analyzed by an analysis filter bank that converts the time domain signal into a spectral (time/frequency) representation. The conversion into spectral coefficients allows for selectively processing signal components depending on their frequency content (e.g. different instruments with their individual overtone structures).
In parallel, the input signal is analyzed with respect to its perceptual properties, i.e. specifically the time- and frequency-dependent masking threshold is computed. The time/frequency dependent masking threshold is delivered to the quantization unit through a target coding threshold in the form of an absolute energy value or a Mask-to-Signal-Ratio (MSR) for each frequency band and coding time frame.
The spectral coefficients delivered by the analysis filter bank are quantized to reduce the data rate needed for representing the signal. This step implies a loss of information and introduces a coding distortion (error, noise) into the signal. In order to minimize the audible impact of this coding noise, the quantizer step sizes are controlled according to the target coding thresholds for each frequency band and frame. Ideally, the coding noise injected into each frequency band is lower than the coding (masking) threshold and thus no degradation in subjective audio is perceptible (removal of irrelevancy). This control of the quantization noise over frequency and time according to psychoacoustic requirements leads to a sophisticated noise shaping effect and is what makes a the coder a perceptual audio coder.
Subsequently, modern audio coders perform entropy coding (e.g. Huffman coding, arithmetic coding) on the quantized spectral data. Entropy coding is a lossless coding step, which further saves on bit rate.
Finally, all coded spectral data and relevant additional parameters (side information, like e.g. the quantizer settings for each frequency band) are packed together into a bitstream, which is the final coded representation intended for file storage or transmission.
Bandwidth Extension
In perceptual audio coding based on filter banks, the main part of the consumed bit rate is usually spent on the quantized spectral coefficients. Thus, at very low bit rates, not enough bits may be available to represent all coefficients in the precision that may be used for achieving perceptually unimpaired reproduction. Thereby, low bit rate requirements effectively set a limit to the audio bandwidth that can be obtained by perceptual audio coding. Bandwidth extension [2] removes this longstanding fundamental limitation. The central idea of bandwidth extension is to complement a band-limited perceptual codec by an additional high-frequency processor that transmits and restores the missing high-frequency content in a compact parametric form.
The high frequency content can be generated based on single sideband modulation of the baseband signal, on copy-up techniques like used in Spectral Band Replication (SBR) [3] or on the application of pitch shifting techniques like e.g. the vocoder [4].
Digital Audio Effects
Time-stretching or pitch shifting effects are usually obtained by applying time domain techniques like synchronized overlap-add (SOLA) or frequency domain techniques (vocoder).
Also, hybrid systems have been proposed which apply a SOLA processing in subbands. Vocoders and hybrid systems usually suffer from an artifact called phasiness [8] which can be attributed to the loss of vertical phase coherence. Some publications relate improvements on the sound quality of time stretching algorithms by preserving vertical phase coherence where it is important [6][7].
State-of-the-art audio coders [1] usually compromise the perceptual quality of audio signals by neglecting important phase properties of the signal to be coded. A general proposal of correcting phase coherence in perceptual audio coders is addressed in [9].
However, not all kinds of phase coherence errors can be corrected at the same time and not all phase coherence errors are perceptually important. For example, in audio bandwidth extension it is not clear from the state-of-the-art, which phase coherence related errors should be corrected with highest priority and which errors can remain only partly corrected or, with respect to their insignificant perceptual impact, be totally neglected.
Especially due to the application of audio bandwidth extension [2][3][4], the phase coherence over frequency and over time is often impaired. The result is a dull sound that exhibits auditory roughness and may contain additionally perceived tones that disintegrate from auditory objects in the original signal and hence being perceived as an auditory object on its own additionally to the original signal. Moreover, the sound may also appear to come from a far distance, being less “buzzy”, and thus evoking little listener engagement [5]
Therefore, there is a need for an improved approach.