The present invention relates to encoding and decoding apparatuses for transmitting a speech signal at a low bit rate and, more particularly, to a speech signal decoding method and apparatus for improving the quality of unvoiced speech.
As a popular method of encoding a speech signal at low and middle bit rates with high efficiency, a speech signal is divided into a signal for a linear predictive filter and its driving sound source signal (sound source signal). One of the typical methods is CELP (Code Excited Linear Prediction). CELP obtains a synthesized speech signal (reconstructed signal) by driving a linear prediction filter having a linear prediction coefficient representing the frequency characteristics of input speech by an excitation signal given by the sum of a pitch signal representing the pitch period of speech and a sound source signal made up of a random number and a pulse. CELP is described in M. Schroeder et al., “Code-excited linear prediction: High-quality speech at very low bit rates”, Proc. of IEEE Int. Conf. on Acoust., Speech and Signal Processing, pp. 937–940, 1985 (reference 1).
Mobile communications such as portable phones require high speech communication quality in noise environments represented by a crowded street of a city and a driving automobile. Speech coding based on the above-mentioned CELP suffers deterioration in the quality of speech (background noise speech) on which noise is superposed. To improve the encoding quality of background noise speech, the gain of a sound source signal is smoothed in the decoder.
A method of smoothing the gain of a sound source signal is described in “Digital Cellular Telecommunication System; Adaptive Multi-Rate Speech Transcoding”, ETSI Technical Report, GSM 06.90 version 2.0.0, January 1999 (reference 2).
FIG. 4 shows an example of a conventional speech signal decoding apparatus for improving the coding quality of background noise speech by smoothing the gain of a sound source signal. A bit stream is input at a period (frame) of Tfr msec (e.g., 20 msec), and a reconstructed vector is calculated at a period (subframe) of Tfr/Nsfr msec (e.g., 5 msec) for an integer Nsfr (e.g., 4). The frame length is given by Lfr samples (e.g., 320 samples), and the subframe length is given by Lsfr samples (e.g., 80 samples). These numbers of samples are determined by the sampling frequency (e.g., 16 kHz) of an input signal. Each block will be described.
The code of a bit stream is input from an input terminal 10. A code input circuit 1010 segments the code of the bit stream input from the input terminal 10 into several segments, and converts them into indices corresponding to a plurality of decoding parameters. The code input circuit 1010 outputs an index corresponding to LSP (Linear Spectrum Pair) representing the frequency characteristics of the input signal to an LSP decoding circuit 1020. The circuit 1010 outputs an index corresponding to a delay Lpd representing the pitch period of the input signal to a pitch signal decoding circuit 1210, and an index corresponding to a sound source vector made up of a random number and a pulse to a sound source signal decoding circuit 1110. The circuit 1010 outputs an index corresponding to the first gain to a first gain decoding circuit 1220, and an index corresponding to the second gain to a second gain decoding circuit 1120.
The LSP decoding circuit 1020 has a table which stores a plurality of sets of LSPs. The LSP decoding circuit 1020 receives the index output from the code input circuit 1010, reads an LSP corresponding to the index from the table, and sets the LSP as LSP{circumflex over (q)}j(Nsfr)(n), j=1, Λ, Np in the Nsfrth subframe of the current frame (nth frame). Np is a linear prediction order. The LSPs of the first to (Nsfr−1)th subframes are obtained by linearly interpolating {circumflex over (q)}j(Nsfr)(n) and {circumflex over (q)}j(Nsfr)(n−1). LSP{circumflex over (q)}j(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr are output to a linear prediction coefficient conversion circuit 1030 and smoothing coefficient calculation circuit 1310.
The linear prediction coefficient conversion circuit 1030 receives LSP{circumflex over (q)}j(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr output from the LSP decoding circuit 1020. The linear prediction coefficient conversion circuit 1030 converts the received {circumflex over (q)}j(m)(n) into a linear prediction coefficient {circumflex over (α)}j(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr, and outputs {circumflex over (α)}j(m)(n) to a synthesis filter 1040. Conversion of the LSP into the linear prediction coefficient can adopt a known method, e.g., a method described in Section 5.2.4 of reference 2.
The sound source signal decoding circuit 1110 has a table which stores a plurality of sound source vectors. The sound source signal decoding circuit 1110 receives the index output from the code input circuit 1010, reads a sound source vector corresponding to the index from the table, and outputs the vector to a second gain circuit 1130.
The second gain decoding circuit 1120 has a table which stores a plurality of gains. The second gain decoding circuit 1120 receives the index output from the code input circuit 1010, reads a second gain corresponding to the index from the table, and outputs the second gain to a smoothing circuit 1320.
The second gain circuit 1130 receives the first sound source vector output from the sound source signal decoding circuit 1110 and the second gain output from the smoothing circuit 1320, multiplies the first sound source vector and the second gain to decode a second sound source vector, and outputs the decoded second sound source vector to an adder 1050.
A storage circuit 1240 receives and holds an excitation vector from the adder 1050. The storage circuit 1240 outputs an excitation vector which was input and has been held to the pitch signal decoding circuit 1210.
The pitch signal decoding circuit 1210 receives the past excitation vector held by the storage circuit 1240 and the index output from the code input circuit 1010. The index designates the delay Lpd. The pitch signal decoding circuit 1210 extracts a vector for Lsfr samples corresponding to the vector length from the start point of the current frame to a past point by Lpd samples in the past excitation vector. Then, the circuit 1210 decodes a first pitch signal (vector). For Lpd<Lsfr, the circuit 1210 extracts a vector for Lpd samples, and repetitively couples the extracted Lpd samples to decode the first pitch vector having a vector length of Lsfr samples. The pitch signal decoding circuit 1210 outputs the first pitch vector to a first gain circuit 1230.
The first gain decoding circuit 1220 has a table which stores a plurality of gains. The first gain decoding circuit 1220 receives the index output from the code input circuit 1010, reads a first gain corresponding to the index, and outputs the first gain to the first gain circuit 1230.
The first gain circuit 1230 receives the first pitch vector output from the pitch signal decoding circuit 1210 and the first gain output from the first gain decoding circuit 1220, multiplies the first pitch vector and the first gain to generate a second pitch vector, and outputs the generated second pitch vector to the adder 1050.
The adder 1050 receives the second pitch vector output from the first gain circuit 1230 and the second sound source vector output from the second gain circuit 1130, adds them, and outputs the sum as an excitation vector to the synthesis filter 1040.
The smoothing coefficient calculation circuit 1310 receives LSP{circumflex over (q)}j(m)(n) output from the LSP decoding circuit 1020, and calculates an average LSP{overscore (q)}0j(n):{overscore (q)}0j(n)=0.84·{overscore (q)}0j(n−1)+0.16·{circumflex over (q)}j(Nsfr)(n)
The smoothing coefficient calculation circuit 1310 calculates an LSP variation amount d0(m) for each subframe m:
            d      0        ⁡          (      m      )        =                    ∑                  N          p                            j        =        1              ⁢                                                                              q                _                                            0                ⁢                j                                      ⁡                          (              n              )                                ⁢                                                    q                ^                            j                              (                m                )                                      ⁡                          (              n              )                                                                                q            _                                0            ⁢            j                          ⁡                  (          n          )                    The smoothing coefficient calculation circuit 1310 calculates a smoothing coefficient k0(m) of the subframe m:k0(m)=min(0.25, max(0, d0(m)−0.4))/0.25where min(x,y) is a function using a smaller one of x and y, and max(x,y) is a function using a larger one of x and y. The smoothing coefficient calculation circuit 1310 outputs the smoothing coefficient k0(m) to the smoothing circuit 1320.
The smoothing circuit 1320 receives the smoothing coefficient k0(m) output from the smoothing coefficient calculation circuit 1310 and the second gain output from the second gain decoding circuit 1120. The smoothing circuit 1320 calculates an average gain {overscore (g)}0(m) from a second gain ĝ0(m) of the subframe m by
                    g        _            0        ⁡          (      m      )        =            1      5        ⁢                            ∑          4                          i          =          0                    ⁢                                    g            ^                    0                ⁡                  (                      m            -            i                    )                    
The second gain ĝ0(m) is replaced byĝ0(m)=ĝ0(m)·k0(m)+{overscore (g)}0(m)·(1−k0(m))
The smoothing circuit 1320 outputs the second gain ĝ0(m) to the second gain circuit 1130.
The synthesis filter 1040 receives the excitation vector output from the adder 1050 and a linear prediction coefficient αi, i=1, Λ, Np output from the linear prediction coefficient conversion circuit 1030. The synthesis filter 1040 calculates a reconstructed vector by driving the synthesis filter 1/A(z) in which the linear prediction coefficient is set, by the excitation vector. Then, the synthesis filter 1040 outputs the reconstructed vector from an output terminal 20. Letting αi, i=1, Λ, Np be the linear prediction coefficient, the transfer function 1/A(z) of the synthesis filter is given by
            1      /              (        A        )              ⁢    z    =      1    /          (              1        -                                            ∑                              N                p                                                    i              =              1                                ⁢                                    α              i                        ⁢                          z              i                                          )      
FIG. 5 shows the arrangement of a speech signal encoding apparatus in a conventional speech signal encoding/decoding apparatus. A first gain circuit 1230, second gain circuit 1130, adder 1050, and storage circuit 1240 are the same as the blocks described in the conventional speech signal decoding apparatus in FIG. 4, and a description thereof will be omitted.
An input signal (input vector) generated by sampling a speech signal and combining a plurality of samples as one frame into one vector is input from an input terminal 30. A linear prediction coefficient calculation circuit 5510 receives the input vector from the input terminal 30. The linear prediction coefficient calculation circuit 5510 performs linear prediction analysis for the input vector to obtain a linear prediction coefficient. Linear prediction analysis is described in Chapter 8 “Linear Predictive Coding of Speech” of reference 4.
The linear prediction coefficient calculation circuit 5510 outputs the linear prediction coefficient to an LSP conversion/quantization-circuit 5520.
The LSP conversion/quantization circuit 5520 receives the linear prediction coefficient output from the linear prediction coefficient calculation circuit 5510, converts the linear prediction coefficient into LSP, and quantizes the LSP to attain the quantized LSP. Conversion of the linear prediction coefficient into the LSP can adopt a known method, e.g., a method described in Section 5.2.4 of reference 2.
Quantization of the LSP can adopt a method described in Section 5.2.5 of reference 2. As described in the LSP decoding circuit of FIG. 4 (prior art), the quantized LSP is the quantized LSP{circumflex over (q)}j(Nsfr)(n), j=1, Λ, Np in the Nsfr subframe of the current frame (nth frame). The quantized LSPs of the first to (Nsfr−1)th subframes are obtained by linearly interpolating {circumflex over (q)}j(Nsfr)(n) and {circumflex over (q)}j(Nsfr)(n−1). The LSP is LSPqj(Nsfr)(n), j=1, Λ, Np in the Nsfr subframe of the current frame (nth frame). The LSPs of the first to (Nsfr−1)th subframes are obtained by linearly interpolating qj(Nsfr)(n) and qj(Nsfr)(n−1).
The LSP conversion/quantization circuit 5520 outputs the LSPqj(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr, and the quantized LSP{circumflex over (q)}j(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr to a linear prediction coefficient conversion circuit 5030, and an index corresponding to the quantized LSP{circumflex over (q)}j(Nsfr)(n), j=1, Λ, Np to a code output circuit 6010.
The linear prediction coefficient conversion circuit 5030 receives the LSPqj(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr, and the quantized LSP{circumflex over (q)}j(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr output from the LSP conversion/quantization circuit 5520. The circuit 5030 converts qj(m)(n) into a linear prediction coefficient αj(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr, and {circumflex over (q)}j(m)(n) into a quantized linear prediction coefficient αj(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr. The linear prediction coefficient conversion circuit 5030 outputs the αj(m)(n) to the weighting filter 5050 and weighting synthesis filter 5040, and {circumflex over (α)}j(m)(n) to the weighting synthesis filter 5040. Conversion of the LSP into the linear prediction coefficient and conversion of the quantized LSP into the quantized linear prediction coefficient can adopt a known method, e.g., a method described in Section 5.2.4 of reference 2.
The weighting filter 5050 receives the input vector from the input terminal 30 and the linear prediction coefficient output from the linear prediction coefficient conversion circuit 5030, and generates a weighting filter W(z) corresponding to the human sense of hearing using the linear prediction coefficient. The weighting filter is driven by the input vector to obtain a weighted input vector. The weighting filter 5050 outputs the weighted input vector to a subtractor 5060. The transfer function W(z) of the weighting filter 5050 is given by W(z)=Q(z/γ1)/Q(z/γ2). Note that
      Q    ⁡          (              z        /                  γ          1                    )        =      1    -                            ∑                      N            p                                    i          =          1                    ⁢                        α          i                      (            m            )                          ⁢                  γ          1          i                ⁢                  z          i                    and
      Q    ⁡          (              z        /                  γ          2                    )        =      1    -                            ∑                      N            p                                    i          =          1                    ⁢                        α          i                      (            m            )                          ⁢                  γ          2          i                ⁢                  z          i                    where γ1 and γ2 are constants, e.g. γ=0.9 and γ2=0.6. Details of the weighting filter are described in reference 1.
The weighting synthesis filter 5040 receives the excitation vector output from the adder 1050, and the linear prediction coefficient αj(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr, and the quantized linear prediction coefficient {circumflex over (α)}j(m)(n), j=1, Λ, Np, m=1, Λ, Nsfr that are output from the linear prediction coefficient conversion circuit 5030. A weighting synthesis filter H(z)W(z)=Q(z/γi)/[A(z)Q(z/γ2)] having αj(m)(n) and {circumflex over (α)}j(m)(n) is driven by the excitation vector to obtain a weighted reconstructed vector. The transfer function H(z)=1/A(z) of the synthesis filter is given by
      1    /          A      ⁡              (        z        )              =      1    /                  (                  1          -                                                    ∑                                  N                  p                                                            i                =                1                                      ⁢                                                            α                  ^                                i                                  (                  m                  )                                            ⁢                              z                i                                                    )            .      
The subtractor 5060 receives the weighted input vector output from the weighting filter 5050 and the weighted reconstructed vector output from the weighting synthesis filter 5040, calculates their difference, and outputs it as a difference vector to a minimizing circuit 5070.
The minimizing circuit 5070 sequentially outputs all indices corresponding to sound source vectors stored in a sound source signal generation circuit 5110 to the sound source signal generation circuit 5110. The minimizing circuit 5070 sequentially outputs indices corresponding to all delays Lpd within a range defined by a pitch signal generation circuit 5210 to the pitch signal generation circuit 5210. The minimizing circuit 5070 sequentially outputs indices corresponding to all first gains stored in a first gain generation circuit 6220 to the first gain generation circuit 6220, and indices corresponding to all second gains stored in a second gain generation circuit 6120 to the second gain generation circuit 6120.
The minimizing circuit 5070 sequentially receives difference vectors output from the subtractor 5060, calculates their norms, selects a sound source vector, delay Lpd, and first and second gains that minimize the norm, and outputs corresponding indices to the code output circuit 6010. The pitch signal generation circuit 5210, sound source signal generation circuit 5110, first gain generation circuit 6220, and second gain generation circuit 6120 sequentially receive indices output from the minimizing circuit 5070.
The pitch signal generation circuit 5210, sound source signal generation circuit 5110, first gain generation circuit 6220, and second gain generation circuit 6120 are the same as the pitch signal decoding circuit 1210, sound source signal decoding circuit 1110, first gain decoding circuit 1220, and second gain decoding circuit 1120 in FIG. 4 except for input/output connections, and a detailed description of these blocks will be omitted.
The code output circuit 6010 receives an index corresponding to the quantized LSP output from the LSP conversion/quantization circuit 5520, and indices corresponding to the sound source vector, delay Lpd, and first and second gains that are output from the minimizing circuit 5070. The code output circuit 6010 converts these indices into a bit stream code, and outputs it via an output terminal 40.
The first problem is that sound different from normal voiced speech is generated in short unvoiced speech intermittently contained in the voiced speech or part of the voiced speech. As a result, discontinuous sound is generated in the voiced speech. This is because the LSP variation amount d0(m) decreases in the short unvoiced speech to increase the smoothing coefficient. Since d0(m) greatly varies over time, d0(m) exhibits a large value to a certain degree in part of the voiced speech, but the smoothing coefficient does not become 0.
The second problem is that the smoothing coefficient abruptly changes in unvoiced speech. As a result, discontinuous sound is generated in the unvoiced speech. This is because the smoothing coefficient is determined using d0(m) which greatly varies over time.
The third problem is that proper smoothing processing corresponding to the type of background noise cannot be selected. As a result, the decoding quality degrades. This is because the decoding parameter is smoothed based on a single algorithm using only different set parameters.