There are many techniques for compressing (with loss) an audio frequency signal such as speech or music. The coding can be performed directly at the sampling frequency of the input signal, as for example in the ITU-T recommendations G.711 or G.729 in which the input signal is sampled at 8 kHz and the coder and decoder operate at this same frequency.
However, some coding methods use a change of sampling frequency, for example to reduce the complexity of the coding, adapt the coding according to the different frequency subbands to be coded, or convert the input signal for it to correspond to a predefined internal sampling frequency of the coder.
In the subband coding defined in the ITU-T recommendation G.722, the input signal at 16 kHz is divided into 2 subbands (sampled at 8 kHz) which are coded separately by a coder of ADPCM (Adaptive Differential Pulse Code Modulation) type. This division into two subbands is carried out by a bank of quadratic mode mirror filters with Finite Impulse Response (FIR), of order 23, which theoretically results in an analysis-synthesis delay (coder+decoder) of 23 samples at 16 ms; this filter bank is implemented with a polyphase realization. The division into two subbands in G.722 makes it possible to allocate, in a predetermined manner, different bit rates to the two subbands as a function of their a priori perceptual importance and also to reduce the overall coding complexity by executing two coders of ADPCM type at a lower frequency. However, it induces an algorithmic delay compared to a direct ADPCM coding.
Various methods for changing the sampling frequency, also called resampling, of a digital signal are known, by using, for example and in a nonexhaustive manner, an FIR (Finite Impulse Response) filter, an IIR (Infinite Impulse Response) filter or a polynominal interpolation (including the splines). A review of the conventional resampling methods can be found for example in the article by R. W. Schafer, L. R. Rabiner, A Digital Signal Processing Approach to Interpolation, Proceedings of the IEEE, vol. 61, no. 6, June 1973, pp. 692-702.
The advantage of the FIR filter (symmetrical) lies in its simplified implementation and—subject to certain conditions—in the possibility of ensuring a linear phase. A linear phase filtering makes it possible to preserve the waveform of the input signal, but it can also be accompanied by a temporal spreading (or ringing) that can create artifacts of pre-echo type on transients. This method results in a delay (which is dependent on the length of the impulse response), generally of the order of 1 to a few ms to ensure appropriate filtering characteristics (ripple in the bandwidth, rejection level sufficient to eliminate the aliasing or spectral images . . . ).
The alternative of resampling by IIR filter generally leads to a non-linear phase, unless the phase is compensated by an additional all-pass filtering stage as described for example in the article by P. A. Regalia, S. K. Mitra, P. P. Vaidyanathan, The Digital All-Pass Filter: A Versatile Signal Processing Building Block, Proceedings of the IEEE, vol. 76, no. 1, January 1988, with an exemplary realization in the “iirgrpdelay” routine of the MATLAB software; an IIR filter is generally of a lower order but more complex to implement in fixed-point notation, the states (or memories) of the filter being able to reach high dynamic values for the recursive part, and this problem is amplified if a phase compensation by all-pass filtering is used.
FIG. 1 illustrates an example of down-sampling by a ratio of 4/5 with an FIR filter with a length of 2*60+1=121 coefficients at 64 kHz to change from 16 kHz to 12.8 kHz. The x-axes represent the time (grounded to ms to represent the signals clocked at different frequencies) and the y-axes the amplitudes. The squares at the top represent the temporal positions of the samples of the input signal at 16 kHz; it is assumed here that these samples correspond to the end of a 20 ms frame. The continuous vertical lines mark the corresponding sampling instants at 16 kHz. At the bottom of the figure, the dotted vertical lines mark the corresponding sampling instants at 12.8 kHz and the stars symbolize the output samples at 12.8 kHz. Also represented is the impulse response (symmetrical) of 121 coefficients of an FIR filter at 64 kHz, this response is positioned to calculate the last sample of the current frame at 12.8 kHz (the position of the impulse response maximum is aligned with this sample). The circles show the values used (corresponding to the input sampling moment) in a polyphase representation; to obtain the output sample, these values are multiplied by the corresponding input sample and these results are added together. It will be noted in this figure that 12 samples (up to the end of the input frame) at 12.8 kHz cannot be calculated exactly because the input samples after the end of the current frame (start of the next frame) are not yet known; the down-sampling delay in the conditions of FIG. 1 is 12 samples, i.e. 12/12.8=0.9375 ms.
There are techniques for reducing the delay introduced by the changes of sampling frequency of FIR type.
In the 3GPP AMR-WB standard (also defined as the ITU-T recommendation G.722.2), the input signal sampled at 16 kHz is down-sampled at an internal frequency of 12.8 kHz before applying a coding of CELP type; the signal decoded at 12.8 kHz is then resampled at 16 kHz and combined with a high-band signal.
The advantage of passing through an intermediate frequency of 12.8 kHz is that it makes it possible to reduce the complexity of the CELP coding and also to have a frame length that is a multiple of a power of 2, which facilitates the coding of certain CELP parameters. The method used is a conventional resampling by a factor 4/5 by FIR filter (of 121 coefficients at 64 kHz), with a polyphase realization to minimize the complexity.
In theory, this resampling on the coder and on the AMR-WB decoder should result in a delay in a manner similar to the processing represented in FIG. 1. In the case of the AMR-WB codec, with an FIR filter of 121 coefficients, the total delay should in theory be 2×60 samples at 64 kHz, i.e. 2×15 samples at 16 kHz or 1.875 ms; in fact, a specific technique is implemented on the AMR-WB coder to eliminate (compensate) the associated delay in the coder part only and therefore divide the effective delay by 2.
This compensation method is described in the 3GPP standard TS 26.190, Clause 5.1 and in the ITU-T recommendation G.722.2, Clause 5.1. The method for compensating the FIR filtering delay consists in adding, for each new frame sampled at 16 kHz to be converted to 12.8 kHz, a predetermined number of zeros at the end of the current frame. These zeros are defined at the input sampling frequency and their number corresponds to the delay of the resampling FIR filter at this frequency (i.e. 15 zeros added at 16 kHz). The resampling is implemented per 20 ms frame (320 samples). The resampling in the AMR-WB coder is therefore equivalent to complementing the input frame of 320 samples on the left (toward the past) with 30 samples from the end of preceding frame (resampling memory) and on the right with 15 zeros to form a vector of 30+320+15=365 samples, which is then resampled with a factor 4/5. The FIR filter can thus be implemented with a zero phase, therefore without delay, since a null future signal is added. In theory, the FIR resampling by a factor 4/5 is performed according to the following steps:                up-sampling by 4 (from 16 kHz to 64 kHz) by addition of 3 samples at 0 after each input sample        low-pass filtering of transfer function Hdecim(z) of symmetrical FIR type of order 120 at 64 kHz        down-sampling by 5 (from 64 kHz to 12.8 kHz) by keeping only one sample out of five from the low-pass filtered signal.        
In practice, this resampling is implemented in an equivalent manner according to an optimized polyphase realization without calculating the intermediate signal at 64 kHz and without concatenating the signal to be converted with zeros (see the “decim54.c” file of the source code of the AMR-WB codec); the FIR filtering for each “phase” is equivalent to an FIR filter of order 24 at 12.8 kHz with a delay of 12 samples at 12.8 kHz, i.e. 0.9375 ms.
The result of the FIR resampling of each 20 ms frame from 16 kHz to 12.8 kHz is identical to a resampling formed on the “complete” input signal (i.e. not cut up into frames), except for the last 12 samples of each resulting frame at 12.8 kHz which include an error due to the use of a block of zeros as future signal instead of the “true” future signal which is available only on the next frame. In fact, the zeros introduced simulate the case of a null input signal in the next frame.
This processing is illustrated at the end of a 20 ms frame in FIG. 2 which represents the last input samples at 16 kHz by the squares at the top; the vertical lines mark the corresponding sampling moments at 16 kHz. At the bottom of the figure, the stars symbolize the output samples at 12.8 kHz which can be obtained by conventional down-sampling with a delay of 12 samples. Then, the triangles at the bottom correspond to the 12 samples at 12.8 kHz obtained by using at least one sample of null value added at the end of the frame to be able to continue the filtering and reduce the delay. These samples are numbered from #1 to #12 according to their position relative to the end of the output obtained with a conventional filtering. Also represented is the impulse response of the filter at 64 kHz used in the position corresponding to the last output sample at 12.8 kHz (the impulse response maximum is aligned with this sample). The circles show the values used (corresponding to the input sampling moment) in the polyphase representation; to obtain the output sample, these values are multiplied by the corresponding input sample or by 0 for the values after the end of the frame and these results are added together. It can be seen here that, for this last sample, almost half of the samples used from the impulse response are multiplied by the added zeros, which therefore introduces a significant estimation error. It will also be understood that the error of the first samples generated after the conventional filtering (that is to say with only the true input signal) is small (the weight of the impulse response at its end is low) and the error becomes greater with increasing distance from the conventional filtering (the weight of the impulse response then being greater). That will be able to be observed in the results of FIG. 7.
The delay compensation method used in the AMR-WB codec, where zeros are added at the end of each 20 ms block (or frame) to be resampled, makes it possible to eliminate the resampling delay on the coder, but it is not satisfactory generally when the values generated at the end of the current frame (with zeros added at the input) are coded directly and are not replaced by the true values once the input signal of the next frame is known. In fact, these regular errors at the end of each frame generate periodic discontinuities in the transition to the true output signal at the start of the next frame. These discontinuities are often audible and a great nuisance. This is why the delay compensation is applied only on the coder and only in the future signal part, called lookahead, and not on the AMR-WB decoder.
In fact, in the AMR-WB coder, each new 20 ms input frame at 16 kHz corresponds to a time segment corresponding to the last 15 ms of the current frame to be coded by ACELP model and 5 ms of future signal (or lookahead). The first 5 ms of the current frame to be coded have already been received and stored as “lookahead” of the preceding segment. The last 12 samples obtained after resampling from 16 to 12.8 kHz on the coder therefore correspond to the last samples of the 5 ms future signal at 12.8 kHz. Consequently, the current 20 ms frame at 12.8 kHz (i.e. 256 samples) and the 5 ms of future signal (i.e. 64 samples) is complemented with 5 ms of past original signal (loopback) to form the LPC analysis buffer of 384 samples (30 ms) which is weighted by an LPC analysis window of the same length.
The last 12 samples of the “lookahead” at 12.8 kHz comprising a resampling error have a very low relative weight in the window used for the linear prediction (LPC), and a fortiori they have impact only on the estimated LPC envelope and this impact is very negligible. It is important to note that the 12 erroneous samples are replaced by the “exact” resampling values on the next frame, the error is therefore present only temporarily in the current frame for the future signal (lookahead) and affects only the LPC analysis. Thus, the delay compensation technique of the AMR-WB coder does not affect the coding of the waveform of the signal in the current frame in the AMR-WB codec. This mode will hereinafter be referred to as: “use by frame with future signal”. The samples that are thus generated are only used temporarily for intermediate calculations (LPC analysis) and are replaced by the samples correctly resampled when the signal of the next frame is known. It will be noted that, in this configuration, for an output frame of length lg_out for each frame, lg_out+12 samples are generated by the resampling.
This delay compensation technique used on the AMR-WB coder is not applied to the AMR-WB decoder.
Thus, the codec (coder+decoder) has a total algorithmic delay of 25.9375 ms due to the coder (20 ms frame+5 ms lookahead) and to the resampling on the decoder (0.9375 ms).
The delay compensation technique of the AMR-WB coder could not be used to reduce the QMF filtering delay of the G.722 codec, because it would greatly degrade the quality of the coding signal. In effect, in the G.722 codec, the samples resulting from the filtering (the low-band and high-band signals) directly constitute the input signals of the two ADPCM sub-codecs which operate without “lookahead” and which do not make it possible to correct these values from one frame to another. This mode will hereinafter be referred to as: “continuous frame-by-frame use”.