Embodiments according to the invention create audio decoders for providing a decoded audio information on the basis of an encoded audio information.
Some embodiments according to the invention create methods for providing a decoded audio information on the basis of an encoded audio information.
Some embodiments according to the invention create computer programs for performing one of said methods.
Some embodiments according to the invention are related to a time domain concealment for a transform domain codec.
In recent years there is an increasing demand for a digital transmission and storage of audio contents. However, audio contents are often transmitted over unreliable channels, which brings along the risk that data units (for example, packets) comprising one or more audio frames (for example, in the form of an encoded representation, like, for example, an encoded frequency domain representation or an encoded time domain representation) are lost. In some situations, it would be possible to request a repetition (resending) of lost audio frames (or of data units, like packets, comprising one or more lost audio frames). However, this would typically bring a substantial delay, and would therefore necessitate an extensive buffering of audio frames. In other cases, it is hardly possible to request a repetition of lost audio frames.
In order to obtain a good, or at least acceptable, audio quality given the case that audio frames are lost without providing extensive buffering (which would consume a large amount of memory and which would also substantially degrade real time capabilities of the audio coding) it is desirable to have concepts to deal with a loss of one or more audio frames. In particular, it is desirable to have concepts which bring along a good audio quality, or at least an acceptable audio quality, even in the case that audio frames are lost.
In the past, some error concealment concepts have been developed, which can be employed in different audio coding concepts.
In the following, a conventional audio coding concept will be described.
In the 3gpp standard TS 26.290, a transform-coded-excitation decoding (TCX decoding) with error concealment is explained. In the following, some explanations will be provided, which are based on the section “TCX mode decoding and signal synthesis” in reference [1].
A TCX decoder according to the International Standard 3gpp TS 26.290 is shown in FIGS. 7 and 8, wherein FIGS. 7 and 8 show block diagrams of the TCX decoder. However, FIG. 7 shows those functional blocks which are relevant for the TCX decoding in a normal operation or a case of a partial packet loss. In contrast, FIG. 8 shows the relevant processing of the TCX decoding in case of TCX-256 packet erasure concealment.
Worded differently, FIGS. 7 and 8 show a block diagram of the TCX decoder including the following cases:
Case 1 (FIG. 8): Packet-erasure concealment in TCX-256 when the TCX frame length is 256 samples and the related packet is lost, i.e. BFI_TCX=(1); and
Case 2 (FIG. 7): Normal TCX decoding, possibly with partial packet losses.
In the following, some explanations will be provided regarding FIGS. 7 and 8.
As mentioned, FIG. 7 (indicated on drawings FIG. 7A and FIG. 7B) shows a block diagram of a TCX decoder performing a TCX decoding in normal operation or in the case of partial packet loss. The TCX decoder 700 according to FIG. 7 receives TCX specific parameters 710 and provides, on the basis thereof, decoded audio information 712, 714.
The audio decoder 700 comprises a demultiplexer “DEMUX TCX 720”, which is configured to receive the TCX-specific parameters 710 and the information “BFI_TCX”. The demultiplexer 720 separates the TCX-specific parameters 710 and provides an encoded excitation information 722, an encoded noise fill-in information 724 and an encoded global gain information 726. The audio decoder 700 comprises an excitation decoder 730, which is configured to receive the encoded excitation information 722, the encoded noise fill-in information 724 and the encoded global gain information 726, as well as some additional information (like, for example, a bitrate flag “bit_rate_flag”, an information “BFI_TCX” and a TCX frame length information. The excitation decoder 730 provides, on the basis thereof, a time domain excitation signal 728 (also designated with “x”). The excitation decoder 730 comprises an excitation information processor 732, which demultiplexes the encoded excitation information 722 and decodes algebraic vector quantization parameters. The excitation information processor 732 provides an intermediate excitation signal 734, which is typically in a frequency domain representation, and which is designated with Y. The excitation encoder 730 also comprises a noise injector 736, which is configured to inject noise in unquantized subbands, to derive a noise filled excitation signal 738 from the intermediate excitation signal 734. The noise filled excitation signal 738 is typically in the frequency domain, and is designated with Z. The noise injector 736 receives a noise intensity information 742 from a noise fill-in level decoder 740. The excitation decoder also comprises an adaptive low frequency de-emphasis 744, which is configured to perform a low-frequency de-emphasis operation on the basis of the noise filled excitation signal 738, to thereby obtain a processed excitation signal 746, which is still in the frequency domain, and which is designated with X′. The excitation decoder 730 also comprises a frequency domain-to-time domain transformer 748, which is configured to receive the processed excitation signal 746 and to provide, on the basis thereof, a time domain excitation signal 750, which is associated with a certain time portion represented by a set of frequency domain excitation parameters (for example, of the processed excitation signal 746). The excitation decoder 730 also comprises a scaler 752, which is configured to scale the time domain excitation signal 750 to thereby obtain a scaled time domain excitation signal 754. The scaler 752 receives a global gain information 756 from a global gain decoder 758, wherein, in return, the global gain decoder 758 receives the encoded global gain information 726. The excitation decoder 730 also comprises an overlap-add synthesis 760, which receives scaled time domain excitation signals 754 associated with a plurality of time portions. The overlap-add synthesis 760 performs an overlap-and-add operation (which may include a windowing operation) on the basis of the scaled time domain excitation signals 754, to obtain a temporally combined time domain excitation signal 728 for a longer period in time (longer than the periods in time for which the individual time domain excitation signals 750, 754 are provided).
The audio decoder 700 also comprises an LPC synthesis 770, which receives the time domain excitation signal 728 provided by the overlap-add synthesis 760 and one or more LPC coefficients defining an LPC synthesis filter function 772. The LPC synthesis 770 may, for example, comprise a first filter 774, which may, for example, synthesis-filter the time domain excitation signal 728, to thereby obtain the decoded audio signal 712. Optionally, the LPC synthesis 770 may also comprise a second synthesis filter 772 which is configured to synthesis-filter the output signal of the first filter 774 using another synthesis filter function, to thereby obtain the decoded audio signal 714.
In the following, the TCX decoding will be described in the case of a TCX-256 packet erasure concealment. FIG. 8 shows a block diagram of the TCX decoder in this case.
The packet erasure concealment 800 receives a pitch information 810, which is also designated with “pitch_tcx”, and which is obtained from a previous decoded TCX frame. For example, the pitch information 810 may be obtained using a dominant pitch estimator 747 from the processed excitation signal 746 in the excitation decoder 730 (during the “normal” decoding). Moreover, the packet erasure concealment 800 receives LPC parameters 812, which may represent an LPC synthesis filter function. The LPC parameters 812 may, for example, be identical to the LPC parameters 772. Accordingly, the packet erasure concealment 800 may be configured to provide, on the basis of the pitch information 810 and the LPC parameters 812, an error concealment signal 814, which may be considered as an error concealment audio information. The packet erasure concealment 800 comprises an excitation buffer 820, which may, for example, buffer a previous excitation. The excitation buffer 820 may, for example, make use of the adaptive codebook of ACELP, and may provide an excitation signal 822. The packet erasure concealment 800 may further comprise a first filter 824, a filter function of which may be defined as shown in FIG. 8. Thus, the first filter 824 may filter the excitation signal 822 on the basis of the LPC parameters 812, to obtain a filtered version 826 of the excitation signal 822. The packet erasure concealment also comprises an amplitude limiter 828, which may limit an amplitude of the filtered excitation signal 826 on the basis of target information or level information rmswsyn. Moreover, the packet erasure concealment 800 may comprise a second filter 832, which may be configured to receive the amplitude limited filtered excitation signal 830 from the amplitude limiter 822 and to provide, on the basis thereof, the error concealment signal 814. A filter function of the second filter 832 may, for example, be defined as shown in FIG. 8.
In the following, some details regarding the decoding and error concealment will be described.
In Case 1 (packet erasure concealment in TCX-256), no information is available to decode the 256-sample TCX frame. The TCX synthesis is found by processing the past excitation delayed by T, where T=pitch_tcx is a pitch lag estimated in the previously decoded TCX frame, by a non-linear filter roughly equivalent to 1/Â(z). A non-linear filter is used instead of 1/Â(z) to avoid clicks in the synthesis. This filter is decomposed in 3 steps:                Step 1: filtering by        
                              A          ^                ⁡                  (                      z            /            γ                    )                                      A          ^                ⁡                  (          z          )                      ⁢          1              1        -                  α          ⁢                                          ⁢                      z                          -              1                                            ⁢                        to map the excitation delayed by T into the TCX target domain;        Step 2: applying a limiter (the magnitude is limited to ±rmswsyn)        Step 3: filtering by        
          ⁢            1      -              α        ⁢                                  ⁢                  z                      -            1                                              A        ^            ⁡              (                  z          /          γ                )                            to find the synthesis. Note that the buffer OVLP_TCX is set to zero in this case.Decoding of the Algebraic VQ Parameters        
In Case 2, TCX decoding involves decoding the algebraic VQ parameters describing each quantized block {circumflex over (B)}′k of the scaled spectrum X′, where X′ is as described in Step 2 of Section 5.3.5.7 of 3gpp TS 26.290. Recall that X′ has dimension N, where N=288, 576 and 1152 for TCX-256, 512 and 1024 respectively, and that each block B′k has dimension 8. The number K of blocks B′k is thus 36, 72 and 144 for TCX-256, 512 and 1024 respectively. The algebraic VQ parameters for each block B′k are described in Step 5 of Section 5.3.5.7. For each block B′k, three sets of binary indices are sent by the encoder:                a) the codebook index nk, transmitted in unary code as described in Step 5 of Section 5.3.5.7;        b) the rank Ik of a selected lattice point c in a so-called base codebook, which indicates what permutation has to be applied to a specific leader (see Step 5 of Section 5.3.5.7) to obtain a lattice point c;        c) and, if the quantized block {circumflex over (B)}′k (a lattice point) was not in the base codebook, the 8 indices of the Voronoi extension index vector k calculated in sub-step V1 of Step 5 in Section; from the Voronoi extension indices, an extension vector z can be computed as in reference [1] of 3gpp TS 26.290. The number of bits in each component of index vector k is given by the extension order r, which can be obtained from the unary code value of index nk. The scaling factor M of the Voronoi extension is given by M=2r.        
Then, from the scaling factor M, the Voronoi extension vector z (a lattice point in RE8) and the lattice point c in the base codebook (also a lattice point in RE8), each quantized scaled block {circumflex over (B)}′k can be computed as{circumflex over (B)}′k=Mc+z 
When there is no Voronoi extension (i.e. nk<5, M=1 and z=0), the base codebook is either codebook Q0, Q2, Q3 or Q4 from reference [1] of 3gpp TS 26.290. No bits are then necessitated to transmit vector k. Otherwise, when Voronoi extension is used because {circumflex over (B)}′k is large enough, then only Q3 or Q4 from reference [1] is used as a base codebook. The selection of Q3 or Q4 is implicit in the codebook index value nk, as described in Step 5 of Section 5.3.5.7.
Estimation of the Dominant Pitch Value
The estimation of the dominant pitch is performed so that the next frame to be decoded can be properly extrapolated if it corresponds to TCX-256 and if the related packet is lost. This estimation is based on the assumption that the peak of maximal magnitude in spectrum of the TCX target corresponds to the dominant pitch. The search for the maximum M is restricted to a frequency below Fs/64 kHzM=maxi=1 . . . N/32(X′2i)2+(X′2i+1)2 and the minimal index 1≤imax≤N/32 such that (X′2i)2+(X′2i+1)2=M is also found. Then the dominant pitch is estimated in number of samples as Test=N/imax (this value may not be integer). Recall that the dominant pitch is calculated for packet-erasure concealment in TCX-256. To avoid buffering problems (the excitation buffer being limited to 256 samples), if Test>256 samples, pitch_tcx is set to 256; otherwise, if Test≤256, multiple pitch period in 256 samples are avoided by setting pitch_tcx topitch_tcx=max{└n Test┘n integer>0 and n Test≤256}where └⋅┘ denotes the rounding to the nearest integer towards −∞.
In the following, some further conventional concepts will be briefly discussed.
In ISO_IEC_DIS_23003-3 (reference [3]), a TCX decoding employing MDCT is explained in the context of the Unified Speech and Audio Codec.
In the AAC state of the art (confer, for example, reference [4]), only an interpolation mode is described. According to reference [4], the AAC core decoder includes a concealment function that increases the delay of the decoder by one frame.
In the European Patent EP 1207519 B1 (reference [5]), it is described to provide a speech decoder and error compensation method capable of achieving further improvement for decoded speech in a frame in which an error is detected. According to the patent, a speech coding parameter includes mode information which expresses features of each short segment (frame) of speech. The speech coder adaptively calculates lag parameters and gain parameters used for speech decoding according to the mode information. Moreover, the speech decoder adaptively controls the ratio of adaptive excitation gain and fixed gain excitation gain according to the mode information. Moreover, the concept according to the patent comprises adaptively controlling adaptive excitation gain parameters and fixed excitation gain parameters used for speech decoding according to values of decoded gain parameters in a normal decoding unit in which no error is detected, immediately after a decoding unit whose coded data is detected to contain an error.
In view of the known technology, there is a need for an additional improvement of the error concealment, which provides for a better hearing impression.