Several embodiments of the present invention relate to audio processors for generating a processed representation of a framed audio signal using pitch-dependent sampling and re-sampling of the signals.
Cosine or sine-based modulated lapped transforms corresponding to modulated filter banks are often used in applications in source coding due to their energy compaction properties. That is, for harmonic tones with constant fundamental frequencies (pitch), they concentrate the signal energy to a low number of spectral components (sub-bands), which leads to efficient signal representations. Generally, the pitch of a signal shall be understood to be the lowest dominant frequency distinguishable from the spectrum of the signal. In the common speech model, the pitch is the frequency of the excitation signal modulated by the human throat. If only one single fundamental frequency would be present, the spectrum would be extremely simple, comprising the fundamental frequency and the overtones only. Such a spectrum could be encoded highly efficient. For signals with varying pitch, however, the energy corresponding to each harmonic component is spread over several transform coefficients, thus, leading to a reduction of coding efficiency.
One could try to improve coding efficiency for signals with varying pitch by first creating a time-discrete signal with a virtually constant pitch. To achieve this, the sampling rate could be varied proportionally to the pitch. That is, one could re-sample the whole signal prior to the application of the transform such that the pitch is as constant as possible within the whole signal duration. This could be achieved by non-equidistant sampling, wherein the sampling intervals are locally adaptive and chosen such that the re-sampled signal, when interpreted in terms of equidistant samples, has a pitch contour closer to a common mean pitch than the original signal. In this sense, the pitch contour shall be understood to be the local variation of the pitch. The local variation could, for example, be parameterized as a function of a time or sample number.
Equivalently, this operation could be seen as a rescaling of the time axis of a sampled or of a continuous signal prior to an equidistant sampling. Such a transform of time is also known as warping. Applying a frequency transform to a signal which was preprocessed to arrive at a nearly constant pitch, could approximate the coding efficiency to the efficiency achievable for a signal having a generically constant pitch.
The previous approach, however, does have several drawbacks. First, a variation of the sampling rate over a large range, as necessitated by the processing of the complete signal, could lead to a strongly varying signal bandwidth due to the sampling theorem. Secondly, each block of transform coefficients representing a fixed number of input samples would then represent a time segment of varying duration in the original signal. This would make applications with limited coding delay nearly impossible and, furthermore, would result in difficulties in synchronization.
A further method is proposed by the applicants of the international patent application 2007/051548. The authors propose a method to perform the warping on a per-frame basis. However, this is achieved by introducing undesirable constraints to the applicable warp contours applicable.
Therefore, the need exists for alternate approaches to increase the coding efficiency, at the same time maintaining a high quality of the encoded and decoded audio signals.