The present invention relates to audio signal processing, in particular to speech processing, and, more particularly, to an apparatus and a method for improved concealment of the adaptive codebook in ACELP-like concealment (ACELP=Algebraic Code Excited Linear Prediction).
Audio signal processing becomes more and more important. In the field of audio signal processing, concealment techniques play an important role. When a frame gets lost or is corrupted, the lost information from the lost or corrupted frame has to be replaced. In speech signal processing, in particular, when considering ACELP- or ACELP-like-speech codecs, pitch information is very important. Pitch prediction techniques and pulse resynchronization techniques are needed.
Regarding pitch reconstruction, different pitch extrapolation techniques exist in conventional technology.
One of these techniques is a repetition based technique. Most of the state of the art codecs apply a simple repetition based concealment approach, which means that the last correctly received pitch period before the packet loss is repeated, until a good frame arrives and new pitch information can be decoded from the bitstream. Or, a pitch stability logic is applied according to which a pitch value is chosen which has been received some more time before the packet loss. Codecs following the repetition based approach are, for example, G.719 (see G.719: Low-complexity, full-band audio coding for high-quality, conversational applications, Recommendation ITU-T G.719, Telecommunication Standardization Sector of ITU, June 2008, 8.6), G.729 (see G.729: Coding of speech at 8 kbit/s using conjugate-structure algebraic-code-excited linear prediction (cs-acelp), Recommendation ITU-T G.729, Telecommunication Standardization Sector of ITU, June 2012, 4.4), AMR (see [Adaptive multi-rate (AMR) speech codec; error concealment of lost frames (release 11), 3GPP TS 26.091, 3rd Generation Partnership Project, September 2012, 6.2.3.1], [ITU-T, Wideband coding of speech at around 16 kbit/s using adaptive multi-rate wideband (amr-wb), Recommendation ITU-T G.722.2, Telecommunication Standardization Sector of ITU, July 2003]), AMR-WB (see [Speech codec speech processing functions; adaptive multi-rate-wideband (AMRWB) speech codec; error concealment of erroneous or lost frames, 3GPP TS 26.191, 3rd Generation Partnership Project, September 2012, 6.2.3.4.2]) and AMR-WB+(ACELP and TCX20 (ACELP like) concealment) (see 3GPP; Technical Specification Group Services and System Aspects, Extended adaptive multi-rate-wideband (AMR-WB+) codec, 3GPP TS 26.290, 3rd Generation Partnership Project, 2009); (AMR=Adaptive Multi-Rate; AMR-WB=Adaptive Multi-Rate-Wideband).
Another pitch reconstruction technique of conventional technology is pitch derivation from time domain. For some codecs, the pitch is necessitated for concealment, but not embedded in the bitstream. Therefore, the pitch is calculated based on the time domain signal of the previous frame in order to calculate the pitch period, which is then kept constant during concealment. A codec following this approach is, for example, G.722, see, in particular G.722 Appendix 3 (see [G.722 Appendix III: A high-complexity algorithm for packet loss concealment for G.722, ITU-T Recommendation, ITU-T, November 2006, III.6.6 and III.6.7]) and G.722 Appendix 4 (see G.722 Appendix IV: A low-complexity algorithm for packet loss concealment with G.722, ITU-T Recommendation, ITU-T, August 2007, IV.6.1.2.5).
A further pitch reconstruction technique of conventional technology is extrapolation based. Some state of the art codecs apply pitch extrapolation approaches and execute specific algorithms to change the pitch accordingly to the extrapolated pitch estimates during the packet loss. These approaches will be described in more detail as follows with reference to G.718 and G.729.1.
At first, G.718 considered (see G.718: Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32 kbit/s, Recommendation ITU-T G.718, Telecommunication Standardization Sector of ITU, June 2008). An estimation of the future pitch is conducted by extrapolation to support the glottal pulse resynchronization module. This information on the possible future pitch value is used to synchronize the glottal pulses of the concealed excitation.
The pitch extrapolation is conducted only if the last good frame was not UNVOICED. The pitch extrapolation of G.718 is based on the assumption that the encoder has a smooth pitch contour. Said extrapolation is conducted based on the pitch lags dfr[i] of the last seven subframes before the erasure.
In G.718, a history update of the floating pitch values is conducted after every correctly received frame. For this purpose, the pitch values are updated only if the core mode is other than UNVOICED. In the case of a lost frame, the difference Δdfr[i] between the floating pitch lags is computed according to the formulaΔdfr[i]=dfr[i]−dfr[i−1] for i=−1, . . . ,−6  (1)In formula (1), dfr[−1] denotes the pitch lag of the last (i.e. 4th) subframe of the previous frame; dfr[−2] denotes the pitch lag of the 3rd subframe of the previous frame; etc.
According to G.718, the sum of the differences Δdfr[i] is computed as
                              s          Δ                =                              ∑                          i              =                              -                1                                                    -              6                                ⁢                                          ⁢                      Δ            dfr                          [              i              ]                                                          (        2        )            
As the values Δdfr[i] can be positive or negative, the number of sign inversions of Δdfr[i] is summed and the position of the first inversion is indicated by a parameter being kept in memory.
The parameter fcorr is found by
                              f          corr                =                  1          -                                                                      ∑                                      i                    =                                          -                      1                                                                            -                    6                                                  ⁢                                                                  ⁢                                                      (                                                                  Δ                        dfr                                                  [                                                      -                            i                                                    ]                                                                    -                                              s                        Δ                                                              )                                    2                                                                    6              ·                              d                max                                                                        (        3        )            wherein dmax=231 is the maximum considered pitch lag.
In G.718, a position imax, indicating the maximum absolute difference is found according to the definitionimax{maxi=−1−6(abs(Δdfr[i])))}and a ratio for this maximum difference is computed as follows:
                              r          max                =                                                      5              ·                              Δ                dfr                                  [                                      i                    max                                    ]                                                                    (                                                s                  Δ                                -                                  Δ                  dfr                                      [                                          i                      max                                        ]                                                              )                                                                    (        4        )            
If this ratio is greater than or equal to 5, then the pitch of the 4th subframe of the last correctly received frame is used for all subframes to be concealed. If this ratio is greater than or equal to 5, this means that the algorithm is not sure enough to extrapolate the pitch, and the glottal pulse resynchronization will not be done.
If rmax is less than 5, then additional processing is conducted to achieve the best possible extrapolation. Three different methods are used to extrapolate the future pitch. To choose between the possible pitch extrapolation algorithms, a deviation parameter fcorr2 is computed, which depends on the factor fcorr and on the position of the maximum pitch variation imax. However, at first, the mean floating pitch difference is modified to remove too large pitch differences from the mean.
If fcorr<0.98 and if imax=3, then the mean fractional pitch difference Δdfr is determined according to the formula:
                                          Δ            _                    dfr                =                  (                                                    s                Δ                            -                              Δ                dfr                                  [                                      -                    4                                    ]                                            -                              Δ                dfr                                  [                                      -                    5                                    ]                                                      3                    )                                    (        5        )            to remove the pitch differences related to the transition between two frames.
If fcorr≥0.98 or if imax≠3, the mean fractional pitch difference Δdfr is computed as
                                          Δ            _                    dfr                =                                            s              Δ                        -                          Δ              dfr                              [                                  i                  max                                ]                                              6                                    (        6        )            and the maximum floating pitch difference is replaced with this new mean valueΔdfr[imax]=Δdfr  (7)
With this new mean of the floating pitch differences, the normalized deviation fcorr2 is computed as:
                              f                      corr            ⁢                                                  ⁢            2                          =                  1          -                                                                                          Σ                                          i                      =                                              -                        1                                                                                    I                      sf                                                        ⁡                                      (                                                                  Δ                        dfr                                                  [                          i                          ]                                                                    -                                                                        Δ                          _                                                dfr                                                              )                                                  2                                                                    I                sf                            ·                              d                max                                                                        (        8        )            wherein Isf is equal to 4 in the first case and is equal to 6 in the second case.
Depending on this new parameter, a choice is made between the three methods of extrapolating the future pitch:                1. If Δdfr[i] changes sign more than twice (this indicates a high pitch variation), the first sign inversion is in the last good frame (for i<3), and fcorr2>0.945, the extrapolated pitch, dext, (the extrapolated pitch is also denoted as Text) is computed as follows:        
            s      y        =                  ∑                  i          =                      -            1                                    -          4                    ⁢                          ⁢              Δ        dfr                  [          i          ]                                s      xy        =                  Δ        dfr                  [                      -            2                    ]                    +              2        ·                  Δ          dfr                      [                          -              3                        ]                              +              3        ·                  Δ          dfr                      [                          -              4                        ]                                          d      est        =                  round        ⁡                  [                                    Δ              fr                              [                                  -                  1                                ]                                      +                          (                                                (                                                            7                      ·                                              s                        y                                                              -                                          3                      ·                                              s                        xy                                                                              )                                10                            )                                ]                    .                      2. If 0.945<fcorr2<0.99 and Δdfri changes sign at least once, the weighted mean of the fractional pitch differences is employed to extrapolate the pitch. The weighting, fw, of the mean difference is related to the normalized deviation, fcorr2, and the position of the first sign inversion is defined as follows:        
      f    w    =            f              corr        ⁢                                  ⁢        2              ·          (                        i          mem                7            )                                          The parameter imem of the formula depends on the position of the first sign inversion of Δdfri, such that imem=0 if the first sign inversion occurred between the last two subframes of the past frame, such that imem=1 if the first sign inversion occurred between the 2nd and 3rd subframes of the past frame, and so on. If the first sign inversion is close to the last frame end, this means that the pitch variation was less stable just before the lost frame. Thus the weighting factor applied to the mean will be close to 0 and the extrapolated pitch dext will be close to the pitch of the 4th subframe of the last good frame:dext=round[Δfr[−1]+4·Δdfr·fw]                        3. Otherwise, the pitch evolution is considered stable and the extrapolated pitch dext is determined as follows:dext=round[dfr[−1]+4·Δdfr].        
After this processing, the pitch lag is limited between 34 and 231 (values denote the minimum and the maximum allowed pitch lags).
Now, to illustrate another example of extrapolation based pitch reconstruction techniques, G.729.1 is considered (see G.729.1: G.729-based embedded variable bit-rate coder: An 8-32 kbit/s scalable wideband coder bitstream interoperable with g.729, Recommendation ITU-T G.729.1, Telecommunication Standardization Sector of ITU, May 2006).
G.729.1 features a pitch extrapolation approach (see Yang Gao, Pitch prediction for packet loss concealment, European Patent 2 002 427 B1), in case that no forward error concealment information (e.g., phase information) is decodable. This happens, for example, if two consecutive frames get lost (one superframe consists of four frames which can be either ACELP or TCX20). There are also TCX40 or TCX80 frames possible and almost all combinations of it.
When one or more frames are lost in a voiced region, previous pitch information is used to reconstruct the current lost frame. The precision of the current estimated pitch may directly influence the phase alignment to the original signal, and it is critical for the reconstruction quality of the current lost frame and the received frame after the lost frame. Using several past pitch lags instead of just copying the previous pitch lag would result in statistically better pitch estimation. In the G.729.1 coder, pitch extrapolation for FEC (FEC=forward error correction) consists of linear extrapolation based on the past five pitch values. The past five pitch values are P(i), for i=0, 1, 2, 3, 4, wherein P(4) is the latest pitch value. The extrapolation model is defined according to:P′(i)=a+i·b  (9)
The extrapolated pitch value for the first subframe in a lost frame is then defined as:P′(5)=a+5·b  (9)
In order to determine the coefficients a and b, an error E is minimized, wherein the error E is defined according to:
                                                        E              =                                                ∑                                      i                    =                    0                                    4                                ⁢                                                                  ⁢                                                      [                                                                                            P                          ′                                                ⁡                                                  (                          i                          )                                                                    -                                              P                        ⁡                                                  (                          i                          )                                                                                      ]                                    2                                                                                                        =                                                ∑                                      i                    =                    0                                    4                                ⁢                                                                  ⁢                                                      [                                                                  (                                                  a                          +                                                      b                            ·                            i                                                                          )                                            -                                              P                        ⁡                                                  (                          i                          )                                                                                      ]                                    2                                                                                        (        11        )            By setting
                                          δ            ⁢                                                  ⁢            E                                δ            ⁢                                                  ⁢            a                          =                              0            ⁢                                                  ⁢            and            ⁢                                                  ⁢                                          δ                ⁢                                                                  ⁢                E                                            δ                ⁢                                                                  ⁢                b                                              =          0                                    (        12        )            a and b result to:
                    a        =                                                                              3                  ⁢                                                            ∑                                              i                        =                        0                                            4                                        ⁢                                                                                  ⁢                                          P                      ⁡                                              (                        i                        )                                                                                            -                                                      ∑                                          i                      =                      0                                        4                                    ⁢                                                                          ⁢                                      i                    ·                                          P                      ⁡                                              (                        i                        )                                                                                                        5                        ⁢                                                  ⁢            and            ⁢                                                  ⁢            b                    =                                                                      ∑                                      i                    =                    0                                    4                                ⁢                                                                  ⁢                                  i                  ·                                      P                    ⁡                                          (                      i                      )                                                                                  -                              2                ⁢                                                      ∑                                          i                      =                      0                                        4                                    ⁢                                                                          ⁢                                      P                    ⁡                                          (                      i                      )                                                                                            10                                              (        13        )            
In the following, a frame erasure concealment concept of conventional technology for the AMR-WB codec as presented in Xinwen Mu, Hexin Chen, and Yan Zhao, A frame erasure concealment method based on pitch and gain linear prediction for AMR-WB codec, Consumer Electronics (ICCE), 2011 IEEE International Conference on, January 2011, pp. 815-816, is described. This frame erasure concealment concept is based on pitch and gain linear prediction. Said paper proposes a linear pitch inter/extrapolation approach in case of a frame loss, based on a Minimum Mean Square Error Criterion.
According to this frame erasure concealment concept, at the decoder, when the type of the last valid frame before the erased frame (the past frame) is the same as that of the earliest one after the erased frame (the future frame), the pitch P(i) is defined, where i=−N, −N+1, . . . , 0, 1, . . . , N+4, N+5, and where N is the number of past and future subframes of the erased frame. P(1), P(2), P(3), P(4) are the four pitches of four subframes in the erased frame, P(0), P(−1), . . . , P(−N) are the pitches of the past subframes, and P(5), P(6), . . . , P(N+5) are the pitches of the future subframes. A linear prediction model P′(i)=a+b i is employed. For i=1, 2, 3, 4; P′(1), P′(2), P′(3), P′(4) are the predicted pitches for the erased frame. The MMS Criterion (MMS=Minimum Mean Square) is taken into account to derive the values of two predicted coefficients a and b according to an interpolation approach. According to this approach, the error E is defined as:
                                                        E              =                                                                    ∑                                          -                      N                                        0                                    ⁢                                                                          ⁢                                                            [                                                                                                    P                            ′                                                    ⁡                                                      (                            i                            )                                                                          -                                                  P                          ⁡                                                      (                            i                            )                                                                                              ]                                        2                                                  +                                                      ∑                    5                                          N                      +                      5                                                        ⁢                                                                          ⁢                                                            [                                                                                                    P                            ′                                                    ⁡                                                      (                            i                            )                                                                          -                                                  P                          ⁡                                                      (                            i                            )                                                                                              ]                                        2                                                                                                                          =                                                                    ∑                                          -                      N                                        0                                    ⁢                                                                          ⁢                                                            [                                              a                        +                                                  b                          ·                          i                                                -                                                  P                          ⁡                                                      (                            i                            )                                                                                              ]                                        2                                                  +                                                      ∑                    5                                          N                      +                      5                                                        ⁢                                                                          ⁢                                                            [                                              a                        +                                                  b                          ·                          i                                                -                                                  P                          ⁡                                                      (                            i                            )                                                                                              ]                                        2                                                                                                          (                  14          ⁢          a                )            Then, the coefficients a and b can be obtained by calculating
                                          δ            ⁢                                                  ⁢            E                                δ            ⁢                                                  ⁢            a                          =                              0            ⁢                                                  ⁢            and            ⁢                                                  ⁢                                          δ                ⁢                                                                  ⁢                E                                            δ                ⁢                                                                  ⁢                b                                              =          0                                    (                  14          ⁢          b                )                                a        =                                            2              ⁡                              [                                                                            ∑                                              i                        =                                                  -                          N                                                                    0                                        ⁢                                                                                  ⁢                                          P                      ⁡                                              (                        i                        )                                                                              +                                                            ∑                                              i                        =                        5                                                                    N                        +                        5                                                              ⁢                                                                                  ⁢                                          P                      ⁡                                              (                        i                        )                                                                                            ]                                      ·                          (                                                N                  3                                +                                  9                  ⁢                                      N                    2                                                  +                                  38                  ⁢                  N                                +                1                            )                                                          (                              N                +                1                            )                        ·                          (                                                4                  ⁢                                      N                    3                                                  +                                  36                  ⁢                                      N                    2                                                  +                                  107                  ⁢                  N                                -                1                            )                                                          (                  14          ⁢          c                )                                b        =                              9            ⁡                          [                                                                    ∑                                          i                      =                                              -                        N                                                              0                                    ⁢                                                                          ⁢                                      P                    ⁡                                          (                      i                      )                                                                      +                                                      ∑                                          i                      =                      5                                                              N                      +                      5                                                        ⁢                                                                          ⁢                                      P                    ⁡                                          (                      i                      )                                                                                  ]                                            1            -                          107              ⁢              N                        -                          36              ⁢                              N                2                                      -                          4              ⁢                              N                3                                                                        (                  14          ⁢          d                )            
The pitch lags for the last four subframes of the erased frame can be calculated according to:P′(1)=a+b·1; P′(2)=a+b·2P′(3)=a+b·3; P′(4)=a+b·4  (14e)
It is found that N=4 provides the best result. N=4 means that five past subframes and five future subframes are used for the interpolation.
However, when the type of the past frames is different from the type of the future frames, for example, when the past frame is voiced but the future frame is unvoiced, just the voiced pitches of the past or the future frames are used to predict the pitches of the erased frame using the above extrapolation approach.
Now, pulse resynchronization in conventional technology is considered, in particular with reference to G.718 and G.729.1. An approach for pulse resynchronization is described in Tommy Vaillancourt, Milan Jelinek, Philippe Gournay, and Redwan Salami, Method and device for efficient frame erasure concealment in speech codecs, U.S. Pat. No. 8,255,207 B2, 2012.
At first, constructing the periodic part of the excitation is described.
For a concealment of erased frames following a correctly received frame other than UNVOICED, the periodic part of the excitation is constructed by repeating the low pass filtered last pitch period of the previous frame.
The construction of the periodic part is done using a simple copy of a low pass filtered segment of the excitation signal from the end of the previous frame.
The pitch period length is rounded to the closest integer:Tc=round(last_pitch)  (15a)
Considering that the last pitch period length is Tp, then the length of the segment that is copied, Tr, may, e.g., be defined according to:Tr└=Tp+0.5┘  (15b)
The periodic part is constructed for one frame and one additional subframe.
For example, with M subframes in a frame, the subframe length is
                    L        —            ⁢      subfr        =          L      M        ,wherein L is the frame length, also denoted as Lframe: L=LframeL=L_frame.
FIG. 3 illustrates a constructed periodic part of a speech signal.
T [0] is the location of the first maximum pulse in the constructed periodic part of the excitation. The positions of the other pulses are given by:T[i]=T[0]+i Tc  (16a)corresponding toT[i]=T[0]+i Tr  (16b)
After the construction of the periodic part of the excitation, the glottal pulse resynchronization is performed to correct the difference between the estimated target position of the last pulse in the lost frame (P), and its actual position in the constructed periodic part of the excitation (T[k]).
The pitch lag evolution is extrapolated based on the pitch lags of the last seven subframes before the lost frame. The evolving pitch lags in each subframe are:p[i]=round(Tc+(i+1)δ), 0≤i<M  (17a)where
                    δ        =                                            T              ext                        -                          T              c                                M                                    (                  17          ⁢          b                )            and Text (also denoted as dext) is the extrapolated pitch as described above for dext.
The difference, denoted as d, between the sum of the total number of samples within pitch cycles with the constant pitch (Tc) and the sum of the total number of samples within pitch cycles with the evolving pitch, p[i], is found within a frame length. There is no description in the documentation how to find d.
In the source code of G.718 (see G.718: Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32 kbit/s, Recommendation ITU-T G.718, Telecommunication Standardization Sector of ITU, June 2008), d is found using the following algorithm (where M is the number of subframes in a frame):
ftmp = p[0];i = i;while (ftmp < L_frame − pit_min) {  sect = (short)(ftmp*M/L_frame);  ftmp += p[sect];  i++;}d = (short)(i*Tc − ftmp);
The number of pulses in the constructed periodic part within a frame length plus the first pulse in the future frame is N. There is no description in the documentation how to find N.
In the source code of G.718 (see G.718: Frame error robust narrow-band and wideband embedded variable bit-rate coding of speech and audio from 8-32 kbit/s, Recommendation ITU-T G.718, Telecommunication Standardization Sector of ITU, June 2008), N is found according to:
                    N        =                  1          +                      ⌊                                                            L                  —                                ⁢                frame                            Tc                        ⌋                                              (                  18          ⁢          a                )            
The position of the last pulse T [n] in the constructed periodic part of the excitation that belongs to the lost frame is determined by:
                    n        =                  {                                                                                          N                    -                    1                                    ,                                                            T                      ⁡                                              [                                                  N                          -                          1                                                ]                                                              <                                                                  L                        —                                            ⁢                      frame                                                                                                                                                                N                    -                    2                                    ,                                                            T                      ⁡                                              [                                                  N                          -                          1                                                ]                                                              ≥                                                                  L                        —                                            ⁢                      frame                                                                                                                              (                  18          ⁢          b                )            
The estimated last pulse position P is:P=T┌n′┐+d  (19a)
The actual position of the last pulse position T [k] is the position of the pulse in the constructed periodic part of the excitation (including in the search the first pulse after the current frame) closest to the estimated target position P:∀i|T[k]−P|≤|T[i]−P|, 0≤i<N  (19b)
The glottal pulse resynchronization is conducted by adding or removing samples in the minimum energy regions of the full pitch cycles. The number of samples to be added or removed is determined by the difference:dif f=P−T[k]  (19c)
The minimum energy regions are determined using a sliding 5-sample window. The minimum energy position is set at the middle of the window at which the energy is at a minimum. The search is performed between two pitch pulses from T[i]+Tc/8 to T[i+1]−Tc/4. There are Nmin=n−1 minimum energy regions.
If Nmin=1, then there is only one minimum energy region and dif f samples are inserted or deleted at that position.
For Nmin>1, less samples are added or removed at the beginning and more towards the end of the frame. The number of samples to be removed or added between pulses T[i] and T[i+1] is found using the following recursive relation:
                              R          ⁡                      [            i            ]                          =                                            round              ⁡                              (                                                                                                                              (                                                      i                            +                            1                                                    )                                                2                                            2                                        ⁢                    f                                    -                                                            ∑                                              k                        =                        0                                                                    i                        -                        1                                                              ⁢                                                                                  ⁢                                          R                      ⁡                                              [                        k                        ]                                                                                            )                                      ⁢                                                  ⁢            with            ⁢                                                  ⁢            f                    =                                    2              |              diff              |                                      N              min              2                                                          (                  19          ⁢          d                )            If R[i]<R[i−1], then the values of R [i] and R[i−1] are interchanged.