The International Telecommunication Union (ITU) Recommendation G.729 Annex E describes coding of analogue signals by methods other than PCM. This higher bit-rate extension of G.729 is designed to accommodate a wide range of input signals such as speech with background noise and music. The G.729 Annex E introduces a backward LP analysis and introduces two new algebraic expectation codebooks to extend the bit rate. One codebook is used in forward mode, the other codebook is used in backward mode. Two LP analyses are performed at the same frame rate, one backward on the synthesis signal and one forward on the input signal. An adaptive decision procedure chooses the best filter and performs a switch between filters if needed. The backward/forward decision criterion enables the operation of a real discrimination between speech (mainly coded in forward mode) and music (mainly coded in backward mode.)
The overall general operation of the G.729 codec is illustrated in FIG. 1 which is a simplified functional block diagram of the encoding of an audio signal and FIG. 2 which is a simplified functional block diagram of the decoding of an audio signal and FIG. 3 which is a simplified block diagram of the fixed codebook search. First, as illustrated by block of 12, in FIG. 1, an audio signal is received in analogue form by a device such as a telephone. The analogue signal is converted to a digital signal and pre-processed 14. The digital signal S will have a sample rate, for example 80 samples per 10 ms. The signal S is then encoded as defined by the codec. The signal is passed through an L/P filter 16 which processes the signal both backwards and forwards as detailed below. The L/P filter 16 generates that portion of the codec corresponding to the short-term characteristics of the original audio signal. The signal is processed to generate portions of the codec corresponding to the characteristics of the original audio signal.
In accordance with the specifications of the G.729 Annex E. codec, the residual portion of the signal is used to generate a series of pulses from which the residual signal is re-created by the decoder. The residual filter relies upon a codebook, FIG. 5, to select the samples to be used for encoding and decoding. In the example above, the signal can be divided into 5 ms sample size. Each five millisecond portion of the signal consists of forty samples. Based on the residual signal, the fixed codebook search 20 selects a subset of these samples and generates a series of pulses of having either a positive or negative value corresponding to the selected samples. The decoder relies on these samples to recreate the residual signal. The fixed codebook search algorithm evaluates a number of different groups of selected samples to determine the sample selection which will best recreate the original signal when regenerated by the decoder. The fixed codebook algorithm implements a search procedure to find the minimized mean squared error between the weighted input speech and the reconstructed speech.
The samples can be designated as samples one through forty, as illustrated in FIG. 2. The fixed codebook search algorithm selects the samples to be used based upon the codebook of the G.729 annex E. The fixed codebook search algorithm selects a set of samples, for example samples 0, 5, 10, 15, 20, 25, 30, 35 from track one of the codebook, FIG. 5. The search algorithm process the input speech based upon these selected samples and creates the code vectors which would be transmitted to the decoder as part of the packetized transmission, FIG. 1.
As illustrated in FIG. 3, the code vectors are also processed within the encoder to reconstruct the signal and the reconstructed signal is compared to the input speech. The difference between the reconstructed speech and the input speech is measured and quantified and stored in a register 22. This process is repeated for other sample sets from tracks 1 through 5. Once all of the samples sets have been processed and the deviation from the original speech quantified, the register is checked to determine which set of samples produced the minimum difference from the original input speech 23. The set of samples with the minimum difference are encoded into the bit stream.
The structure of the codec and code vectors is illustrated in FIG. 4. Since the LP coefficients are not transmitted in backward mode, the spare bit rate is used to increase the size of the algebraic excitation codebooks. One information bit is needed to indicate the LP mode and is protected by a parity bit. In the extension, all the additional bit rate from 8 kbit/s to 11.8 kbit/s, except two bits (LP indication mode+parity bit), is used to increase the size of the algebraic codebooks. The bit allocation of the coder parameters is shown in the table of FIG. 4.
The backward/forward procedure of G.729 Annex E has been also designed to reduce the number of switches and to perform, when necessary, smooth switching between filters with no artefacts. The LP mode and the related information is used to better adapt postfiltering and perceptual weighting to either music or speech. This is also used for error concealment.
In order to obtain this high quality with music while maintaining robust resistence to transmission errors and avoiding degradation of less stationary signals and especially speech, Annex E of G.729 introduced a new technique called mixed backward/forward LP structure. A criterion enabled to choose the most suitable LP analysis given the stationarity of the input signal and the backward and forward filters prediction gains.
For music signals, generally very stationary, the LP backward mode is mainly used: the LP analysis is performed on the synthesis signal with no transmission of the coefficients with two benefits: The LP order is increased up to 30 coefficients which is far more suited for the complex spectrum of music signals (the 10 coefficients LP filter of LP forward codecs like G.729 is not sufficient for music) and the bit rate is better allocated: no bit rate is wasted on successive very similar LP filters. All the spare bit rates are used to extend the size of the excitation codebook. An algebraic codebook with 44 bits is used for the fixed codebook excitation. The weak points of pure backward LP analysis mainly concern the non-stationary signals with sharp spectrum transitions and the sensitivity to transmission errors. With the mixed LP backward/forward structure, if a spectrum transition occurs, the forward mode is selected and the 10 LP coefficients are coded and transmitted. Even if backward mode is dominant, the transmission of forward LP filters clearly improves the robustness when compared with a pure backward structure.
In forward mode, the encoder is almost identical to G.729 with more bits allocated to the excitation codebooks. An algebraic codebook with thirty five bits is used for the fixed codebook excitation.
When decoding, FIG. 1, the fixed codebook 32 and adaptive codebook 34 decode is implemented and the signal is processed by the short term filter 36. Decoding obtains the coder parameters corresponding to a 10 ms speech frame. The first parameter decoded is the LP mode information and its parity bit. According to this information, the frame is classified either as forward, backward or erased. In forward mode, the parameters are the LSP coefficients, the two fractional pitch delays, the two forward fixed-codebook vectors, and the two sets of adaptive-and fixed-codebook gains. In backward mode, the parameters are the two fractional pitch delays, the two backward fixed-codebook vectors, and the two sets of adaptive-and fixed-codebook gains. First the LP backward analysis is performed. Then, if the frame is in forward mode, the LSP coefficients are interpolated and converted to LP filter coefficients for each sub-frame. Except for the construction of fixed-codebook excitation, the decoding procedure is very similar to the G.729 decoding procedure.
Then, for each 5 ms sub-frame the following steps are done: first, the excitation is constructed by adding the adaptive-and fixed-codebook vectors scaled by their respective gains. Next, the speech is reconstructed by filtering the excitation through the LP synthesis filter (either forward or backward). Then, the reconstructed speech signal is passed through a post-processing stage 37, which can include an adaptive postfilter based on the long-term and short-term synthesis filters, followed by a high-pass filter and scaling operation. Compared with G.729, the weighting factors of the postfilter have been made adaptive. The speech coding algorithms are bit-exact, fixed-point mathematical operations.
The encoder has several different functions, including:                Pre-processing.        Linear prediction analysis and quantization.        Windowing and autocorrelation computation.        Levinson Durbin algorithm implementation.        LP to LSP conversion.        Quantization of LSP coefficients.        Interpolation of LP coefficients.        LSP to LP conversion.        Backward/forward decision and switching.        Determination of the global stationarity indicator and high stationarity indicator.        Perceptual weighting.        Open-loop pitch analysis.        Computation of the impulse response.        Computation of the target signals.        
The encoder also implements the adaptive-codebook search wherein the generation of the adaptive-codebook vector, the codeword computation for the delay index P1 and P2 and the computation of the adaptive-codebook gain are identical to the procedure in G.729. The parity bit P0 computed on the seven (instead of six in G.279) most significant bits of the delay index P1 of the first sub-frame.
Annex E introduces a fixed codebook structure and search. In the forward LP mode, an algebraic codebook with 35 bits is used as the fixed codebook. In this codebook, each excitation vector contains 10 non-zero pulses. The pulse amplitudes are either −1 or +1. The 40 positions in each sub-frame are divided into 5 tracks where each track contains two pulses. In the design, the two pulses for each track may overlap resulting in a single pulse with amplitude +2 or −2. The allowed positions for pulses are illustrated in FIG. 5.
Similar to G.729, the selected codebook vector is filtered through the pre-filter to enhanced the harmonic components. The codebook is searched to determine the optimal pulse positions within the sample.
The fixed codebook is searched by minimizing the mean-squared error between the weighted input speech and the weighted reconstructed speech. If ck(n) is the algebraic codevector at index k, h(n) is the impulse response of the weighted synthesis filter, and d(n) is the correlation between the target vector and h(n), then the algebraic codebook is searched by maximizing the criterion:
      T    k    =                    (                  C          k                )            2              E      k      where C is the correlation between ck(n) and d(n) and E is the energy of the filtered codevector (ck(n)*h(n)). Since the algebraic codevector contains few non-zero pulses, the correlation can be written as:
  C  =            ∑              i        =        0                              N          p                -        1              ⁢                  s        i            ⁢              d        ⁡                  (                      m            i                    )                    where ml is the position of the ith pulse, sl is its amplitude, and Np is the number of pulses (Np=10), and the energy in the denominator is given by:
  E  =                    ∑                  i          =          0                                      N            p                    -          1                    ⁢              ϕ        ⁡                  (                                    m              i                        ,                          m              i                                )                      +          2      ⁢                        ∑                      i            =            0                                              N              p                        -            2                          ⁢                              ∑                          j              =                              i                +                1                                                                    N                p                            -              1                                ⁢                                    s              i                        ⁢                          s              j                        ⁢                          ϕ              ⁡                              (                                                      m                    i                                    ,                                      m                    i                                                  )                                                        where φ(i,j) contains the correlations between h(n−i) and h(n−j). The signal d(n) and the correlations φ(i,j) are computed before the codebook search.
Similar to G.729, in order to speed up the search procedure, the pulse amplitudes are pre-set outside the closed-loop search using the so-called signal-selected pulse amplitude approach. In this approach, the most likely amplitude of a pulse occurring at a certain position is estimated using a certain side information signal. In G.729, the signal d(n) is used for pre-selecting the pulse amplitudes. In this bit rate extension, a signal b(n), which is a weighted sum of the normalized d(n) vector and the normalized long-term prediction residual, is used.
The signal b(n) is given by:b(n)=d(n)/σd+e(n)/σewhere e(n) is the long-term prediction residual and σd and σe are the r.m.s. values of d(n) and e(n), respectively. The sign of a pulse at a certain position is set a priori equal to the sign of b(n) at that position. The sign information is incorporated into the signals d(n) and φ(i,j) before starting the search for the best pulse positions, similar to G.729.
The optimal pulse positions are determined using a non-exhaustive analysis-by-synthesis search procedure. The used procedure is a special case of a general depth-first tree search method which is efficient for searching huge codebooks with a reasonable complexity. In this approach, the Np excitation pulses are partitioned into M subsets of Nm pulses. The search begins with subset 1 and proceeds with subsequent subsets according to a tree structure whereby subset m is searched at the mth level of the tree. The search is repeated by changing the order in which the pulses are assigned to the position tracks. In this particular codebook structure, the pulses are partitioned into 5 subsets of 2 pulses (the tree has 5 levels).
The pulse positions are determined as follows:
For each of the five tracks, the pulse positions with maximum absolute values of d(n) are found. From these, the two successive tracks, Tk0 and T(k0+1) mod 5 with the largest combined maxima are determined. This index k0 is used for the initial assignment of pulses to tracks. Then the two successive tracks, Tk1 and T(k1+1) mod 5 with the second largest combined maxima and the two successive tracks, Tk2 and T(k2+1) mod 5 with the third largest combined maxima are also determined.
In the first iteration, the pulses are assigned to the tracks as follows: the pulses in, n=0, . . . , 9, are assigned to tracks T(k0+n) mod 5, n=0, . . . , 9, respectively.
The pulses are searched in subsets of two pulses. The process begins by setting pulse i0 to the maximum of track Tk0 and pulse i1 to the maximum of track T(k0+1) mod 5. We then proceed by searching the pulse pair (i2, i3) by testing all the 8×8 possible position combinations in tracks T(k0+2) mod 5 and T(k0+3) mod 5 (given pulses i0 and i1 are known). The same procedure is repeated for the rest of the pulse pairs(i4, i5), (i6, i7), and (i8, i9), by testing the 8×8 possible position combinations in their respective tracks. At each level of the tree, the test criterion is computed based only on the available pulses at that level. This results in a total of 4×8×8 positions tested (since the first pulse pairs are set to their track maxima).
Other two iterations are carried out by changing pulse assignment to tracks (replacing k0 by k1 for the second iteration and k0 by k2 for the third iteration). All 10 initial pulse positions are assigned to tracks T(k1+n) mod 5 in the second iteration and to tracks T(k2+n) mod 5 in the third iteration. The same search procedure described above is repeated for these other two iterations. For the three iterations, the total number of tested position combinations is 3×4×8×8=768.
In order to compute the codeword of the 35-bit fixed codebook, The two pulse positions in each track are encoded with 6 bits and the sign of the first pulse in each track is encoded with one bit. The second pulse sign is implicitly determined based on the order of pulse positions.
The two pulses in each track (2 positions and 2 signs) are encoded in 7 bits. Each pulse position needs 3 bits (8 possible positions) and each sign needs 1 bit. That is a total of 8 bits for each pair of pulses. However, 1 bit can be reduced considering the fact that about half the position combinations are redundant. For example, placing pulse 1 at position a and pulse 2 at position b is equivalent to placing pulse 1 at position b and pulse 2 at position a (when the signs are not considered). A simple approach of implementing the pulse encoding is to use only 1 bit for the sign information and 6 bits for the two positions, while ordering the positions in a way such that the other sign information can be easily deduced.
To better explain this, assume that the two pulses in a track are located at positions p1 and p2 with sign indices s1 and s2, respectively (s=0 if the sign is positive and s=1 if the sign is negative). The index of the two pulses is given by:I=(p1/5)+s1×8+(p2/5)×16
If p1≦p2 then s2=s1; otherwise, s2 is different from s1. Thus, when constructing the codeword, if the two signs are equal, then the smaller position is assigned to p1 and the larger position to p2; otherwise, the larger position is assigned to p1 and the smaller position to p2. This procedure is repeated for each track to obtain five 7-bit indices.
The fixed codebook in backward LP mode differs from the forward mode. In the backward LP mode, the 18 bits needed for LP model are not transmitted. Thus, 9 bits are saved every sub-frame, which are used to increase the size of the fixed codebook from 35 to 44 bits. In this 44-bit codebook, each codebook vector contains 12 pulses. The positions in a sub-frame are divided into the same track structure described in Table E.2. However, two more pulses are placed, such that two consecutive tracks can contain three pulses instead of two. The two consecutive tracks containing three pulses will be called triple-pulse tracks and the other three tracks containing two pulses will be called double-pulse tracks.
The pulses in each double-pulse track are encoded with 7 bits (as in the 35-bit codebook) and those in each triple-pulse track are encoded with 10 bits. The index of the first triple-pulse track can have 5 different values (5 tracks). This index needs extra 3 bits. This results in a total of 44 bits (3×7+2×10+3).
The search procedure of the 44-bit codebook, is similar to that of the 35-bit codebook, with the exception that the tree has now 6 levels of pulse pairs. The same search procedure described above is followed.                The same procedure is used for pre-setting the pulse signs.        The initial tracks Tk an d Tk+1 are determined in the same manner.        The 12 pulses in, n=0, . . . , 11 are assigned to tracks T(k+n) mod 5, n=0, . . . , 11 respectively.        
The pulses are searched in subsets of two pulses, by initially setting pulse i0 to the maximum of track Tk and pulse i1 to the maximum of track T(k+1) mod 5. Then it is proceeded by searching the pulse pair (i2, i3) by testing all the 8×8 possible position combinations in tracks T(k+2) mod 5 and T(k+3) mod 5 and repeating the procedure for the rest of the pulse pairs (i4, i5), (i6, i7), (i8, i9), and (i10, i11). This results now in a total of 5×8×8 positions tested.
Two more iterations are carried out similar to the 35-bit codebook resulting in a total of 3×5×8×8=960 tested positions.
Similar to G.729 and to the 35-bit forward codebook, the selected codebook vector is filtered through the pre-filter P(z)=1/(1−βz−1) to enhance the harmonic components.
In computation of the codeword of the 44-bit fixed codebook, the two pulses in each of the three double-pulse tracks are encoded using the same approach described above.
The three pulses in a triple-pulse track are encoded using the same philosophy by adding three bits for the position of the third pulse. The three positions are encoded with 3 bits each and the sign of the first pulse is encoded with 1 bit. The signs of the other two pulses are deduced from the pulse orders, similar to the double-pulse tracks. Again, we will explain this with an example. Assume that the three pulses in a triple-pulse track are located at positions p1, p2, and p3 with sign indices s1, s2, and s3, respectively. The index of the three pulses is given by:I=(p1/5)+s1×8+(p2/5)×16+(p3/5)×128
If p1≦p2 then s2=s1; otherwise, s2 is different from s1. Similarly, if p2≦p3 then s3=s2; otherwise, s3 is different from s2. When constructing the codeword, the pulse positions in a track are assigned to p1, p2, and p3 taking this sign relationship into consideration.
In total, 5 indices are returned, one for each track. The first index is that of the first triple-pulse track. This index is encoded with 13 bits; 10 for the positions and signs, as explained above, and 3 for the track index (0 to 4). The second index is that of the second triple-pulse track and is encoded with 10 bits. The last three indices are those of the three double-pulse tracks and are encoded with 7 bits each.
The encoder, FIG. 1, then performs the quantization of the gains in accordance with G.729 and performs a memory update.
The decoder, FIG. 1, functions to decode the signal. First the parameters are decoded. The transmitted parameters are listed in FIGS. 6 and 7. FIG. 6 illustrates the transmitted parameters indices in forward mode and FIG. 7 illustrates the transmitted parameters indices in backward mode. The first parameter decoded is the LP mode information and its parity bit. According to this information, the frame is classified either as forward, backward or erased. In forward mode, the decoder parameters are the LSP coefficients, the two fractional pitch delays, the two forward fixed-codebook vectors, and the two sets of adaptive- and fixed-codebook gains. In backward mode, the decoded parameters are the two fractional pitch delays, the two backward fixed-codebook vectors, and the two sets of adaptive- and fixed-codebook gains. Then, the LP backward analysis is performed on the past synthesized signal and the decoded parameters are used to compute the reconstructed speech signal as will be described below. This reconstructed signal is enhanced by a post-processing operation consisting of a postfilter, a high-pass filter and an upscaling (see E.4.2). Subclause E.4.4 describes the error concealment procedure used when either a parity error has occurred, or when the frame erasure flag has been set.
The parameter decoding procedure is similar to G.729. The number of parameters is greater (more excitation codebooks parameters and one LP mode indication parameter). The decoding process is done in the following order.
First, backward/forward decoding procedure is performed. One bit is used to indicate to the decoder the LP mode: backward or forward. Then, the parity bit mode is compared with this LP mode bit. If these bits are not identical, the frame is considered as erased and the procedure described below is applied. Otherwise, according to this LP mode indication, the same switching procedure as described above is performed at the decoder to obtain the LP filter that will be used for the synthesis.
Next the high stationarity indicator High_Stat(n) is computed once per frame as described above.
Then another high stationarity indicator High_Stat2 that will be used by the gain attenuation procedure in case of erased frame is computed each sub-frame (see E.4.4.3). If the current sub-frame is at least the 30th of consecutive backward subframes, High_Stat2 is set to 1, else it is set to zero.
Next the LP parameters are decoded. In any LP mode (backward or forward) and even if the frame is erased , one backward LP analysis per frame is performed, using the same procedures as those performed in the encoder above to obtain the encoder LP backward filter (windowing and autocorrelation computation, Levinson Durbin algorithm).
In forward mode, the same decoding procedure of the LP parameters is applied as in G.729. The interpolation procedure of the LP coefficients is the same as described above.
In case that one of the previous frames has been erased, the current backward filter computed Abwd(current) is not directly used but linearly interpolated with the last “correct” backward filter prior to the interpolation procedure of the LP coefficients.
Before the excitation is reconstructed, the parity bit is recomputed from the adaptive-codebook delay index P1. If this bit is not identical to the transmitted parity bit P0, it is likely that bit errors occurred during transmission. If a parity error occurs on P1, the delay value T1 is replaced by the delay value calculated in the previous sub-frame.
The adaptive-codebook vector is decoded the same as G.729. However, the fixed-codebook vector is decoded using the codebook indices. The received codebook indices are used to extract the positions and signs of the pulses. This is done by reversing the process described above for the 35-bit and/or 44-bit codebooks, respectively. Once the pulse positions and signs are decoded, the fixed codebook vector c(n) is constructed by:
      c    ⁡          (      n      )        =            ∑              i        =        0                              N          p                -        1              ⁢                  s        i            ⁢              δ        ⁡                  (                      n            -                          p              i                                )                    where s1 are pulse signs, p1 are the pulse positions, and Np is the number of pulses (10 or 12). If the integer part of the pitch delay is less than the sub-frame size 40, c(n) is modified similar to equation (48) in G.729.
The adaptive- and fixed-codebook gains are decoded as described above, the same as G.729. The reconstructed speech is also computed in the same manner. However, the order of the LP filter could be 30 instead of 10.
As in G.729. The post-processing consists of three functions: adaptive postfiltering, high-pass filtering and signal upscaling. The adaptive postfiltering is similar to G.729 postfiltering except for the parameters γp, γn and γd that have been made adaptive according to the high stationarity indicator High_Stat and the current frame LP mode. After twenty consecutive high stationarity backward frames, there is no more postfiltering. The tilt compensation filtering is the same as G.729, except for the computation of the first parcor where the length of the impulse response is thirty two instead of twenty. Adaptive gain control and high-pass filtering and up-scaling are also the same as G.729.