Embodiments according to the invention are related to an audio signal decoder. Further embodiments according to the invention are related to an audio signal encoder. Further embodiments according to the invention are related to a method for decoding an audio signal, to a method for encoding an audio signal and to a computer program.
Some embodiments according to the invention are related to a sampling frequency dependent pitch variation quantization.
In the following, a brief introduction will be given into the field of time-warped audio encoding, concepts of which can be applied in conjunction with some of the embodiments of the invention.
In the recent years, techniques have been developed to transform an audio signal to a frequency-domain representation, and to efficiently encode the frequency-domain representation, for example, by taking into account perceptual masking thresholds. This concept of audio signal encoding is particularly efficient if the block length, for which a set of encoded spectral coefficients are transmitted, is long, and if only a comparatively small number of spectral coefficients are well above the global masking threshold while a large number of spectral coefficients are nearby or below the global masking threshold and can thus be neglected (or coded with minimum code length). A spectrum in which said condition holds is sometimes called a sparse spectrum.
For example, cosine-based or sine-based modulated lapped transforms are often used in applications for source coding due to their energy compaction properties. That is, for harmonic tones with constant fundamental frequencies (pitch), they concentrate the signal energy to a low number of spectral components (sub-bands), which leads to an efficient signal representation.
Generally, the (fundamental) pitch of a signal shall be understood to be the lowest dominant frequency distinguishable from the spectrum of the signal. In the common speech model, the pitch is the frequency of the excitation signal modulated by the human throat. If only one single fundamental frequency would be present, the spectrum would be extremely simple, comprising the fundamental frequency and the overtones only. Such a spectrum could be encoded highly efficiently. For signals with varying pitch, however, the energy corresponding to each harmonic component is spread over several transform coefficients, thus leading to a reduction of coding efficiency.
In order to overcome the reduction of coding efficiency, the audio signal to be encoded is effectively resampled on a non-uniform temporal grid. In the subsequent processing, the sample positions obtained by the non-uniform resampling are processed as if they would represent values on a uniform temporal grid. This operation is commonly denoted by the phrase “time warping”. The sample times may be advantageously chosen in dependence on the temporal variation of the pitch, such that a pitch variation in the time warped version of the audio signal is smaller than a pitch variation in the original version of the audio signal (before time warping). After time warping of the audio signal, the time-warped version of the audio signal is converted into the frequency-domain. The pitch-dependent time warping has the effect that the frequency-domain representation of the time-warped audio signal typically exhibits an energy compaction into a much smaller number of spectral components than a frequency-domain representation of the original (non-time-warped audio signal).
At the decoder side the frequency-domain representation of the time-warped audio signal is converted to the time-domain, such that a time-domain representation of the time-warped audio signal is available at the decoder side. However, in the time-domain representation of the decoder-sided reconstructed time-warped audio signal, the original pitch variations of the encoder-sided input audio signal are not included. Accordingly, yet another time warping by resampling of the decoder-sided reconstructed time-domain representation of the time-warped audio signal is applied.
In order to obtain a good reconstruction of the encoder-sided input audio signal at the decoder, it is desirable that the decoder-skied time warping is at least approximately the inverse operation with respect to the encoder-sided time warping. In order to obtain an appropriate time warping, it is desirable to have an information available at the decoder, which allows for an adjustment of the decoder-sided time warping.
As it is typically necessitated to transfer such an information from the audio signal encoder to the audio signal decoder, it is desirable to keep the bitrate necessitated for this transmission small while still allowing for a reliable reconstruction of the necessitated time warp information at the decoder side.
In view of this situation, there is a desire to have a concept which allows for a reliable reconstruction of a time-warp information on the basis of an efficiently encoded representation of the time-warp information.