1. Field of the Invention
The present invention relates generally to a coding and decoding technique for transmitting speech signals at a low bit rate, and more particularly to a decoding method and a decoding apparatus for improving sound quality in an environment where noise exists.
2. Description of the Prior Art
Methods of coding a speech signal by separating the speech signal to a linear prediction filter and its driving excitation signal (also referred to as excitation signal or excitation vector) are widely used as a method of efficiently coding a speech signal at an intermediate or low bit rate. One typical method thereof is CELP (Code Excited Linear Prediction). In the CELP, an excitation signal (excitation vector) drives a linear prediction filter for which a linear prediction coefficient representing frequency characteristics of input speech is set, thereby obtaining a synthesized speech signal (reproduced speech, reproduced vector). The excitation signal is represented by the sum of a pitch signal (pitch vector) representing a pitch period of speech and a sound source signal (sound source vector) comprising random numbers or pulses. In this case, each of the pitch signal and the sound source signal is multiplied by gain (i.e., pitch gain and sound source gain). For the CELP, reference can be made to M. Schroeder et al., “Code excited linear prediction: High quality speech at very low bit rates”, Proc. of IEEE Int. Conf. on Acoust., Speech and Signal processing, pp. 937–940, 1985 (Literature 1).
Mobile communication systems such as a cellular phone system require favorable quality of speech in noisy environments typified by the hustle and bustle in downtown or the inside of a running car. However, speech coding techniques based on the CELP have a problem of significant deterioration of sound quality for speech on which noise is superimposed, that is, speech with background noise. A time period in a speech signal under a noisy environment is referred to as a noise period.
For improving the quality of coded speech from the speech with background noise, a method of smoothing the sound source gain at a decoder has been proposed. In this method, the smoothing of the sound source gain causes a smooth change with time in short time average power of the sound source signal multiplied by the sound source gain, resulting in a smoothed change with time in short time average power of the excitation signal as well. This leads to mitigation of significant variations in short time average power in decoded noise, which is one of factors for degradation, thereby improving the sound quality.
For a method of smoothing gain in the sound source signal, reference can be made, for example, to Section 6.1 of “Digital Cellular Telecommunication System; Adaptive Multi-Rate Speech Transcoding”, ETSI Technical Report, GSM 06.90, version 2.0.0 (Literature 2).
FIG. 1 is a block diagram showing an example of a configuration of a conventional speech signal decoding apparatus, and illustrates a technique of improving quality of coding of a speech with background noise by smoothing gain in a sound source signal. Assume herein that bit sequences are inputted at a frame period of Tfr (for example, 20 milliseconds), and reproduced vectors are calculated at a subframe period of (Tfr/Nsfr) (for example, 5 milliseconds) where Nsfr is an integer number (for example, 4). A frame length is Lfr samples (for example, 320 samples), and a subframe length is Lsfr samples (for example, 80 samples). These numbers of samples are employed in the case of a sampling frequency of 16 kHz for input signals. Description is hereinafter made for the speech signal decoding apparatus shown in FIG. 1.
Bit sequences of coded data are supplied from input terminal 10. Code input circuit 1010 divides and converts the bit sequences supplied from input terminal 10 to indexes corresponding to a plurality of decoding parameters. Code input circuit 1010 provides an index corresponding to an LSP (Line Spectrum Pair) representing the frequency characteristic of the input signal to LSP decoding circuit 1020, an index corresponding to delay representing the pitch period of the input signal to pitch signal decoding circuit 1210, an index corresponding to a sound source vector including random numbers or pulses to sound source signal decoding circuit 1110, an index corresponding to a first gain to first gain decoding circuit 1220, and an index corresponding to a second gain to second gain decoding circuit 1120.
LSP decoding circuit 1020 contains a table in which plural sets of LSPs are stored. LSP decoding circuit 1020 receives, as its input, the index outputted from code input circuit 1010, reads the LSP corresponding to that index from the table contained therein, and sets the read LSP to LSP: {circumflex over (q)}j(Nsfr)(n), j=1, . . . , Np in Nsfrth subframe of the current frame (n-th frame), where Np represents a linear prediction order. The LSPs from the first to (Nsfr−1)th subframes are derived by linear interpolation of {circumflex over (q)}j(Nsfr)(n) and {circumflex over (q)}j(Nsfr)(n−1). LSP decoding circuit 1020 outputs the LSP: {circumflex over (q)}j(m)(n), j=1, . . . , Np, m=1, . . . , Nsfr to linear prediction coefficient converting circuit 1030 and to smoothing coefficient calculating circuit 1310.
Linear prediction coefficient converting circuit 1030 converts the LSP: {circumflex over (q)}j(m)(n) supplied from LSP decoding circuit 1020 to linear prediction coefficient {circumflex over (α)}j(m)(n), j=1, . . . , Np, m=1, . . . , Nsfr, and outputs it to synthesizing filter 1040. It should be noted that, for the conversion from the LSP to the linear prediction coefficient, known methods can be used, for example the method described in Section 5.2.4 of Literature 2.
Sound source signal decoding circuit 1110 contains a table in which a plurality of sound source vectors are stored. Sound source signal decoding circuit 1110 receives the index outputted from code input circuit 1010, reads the sound source vector corresponding to that index from the table contained therein, and outputs it to second gain circuit 1130.
First gain decoding circuit 1220 includes a table in which a plurality of gains are stored. First gain decoding circuit 1220 receives, as its input, the index outputted from code input circuit 1010, reads the first gain corresponding to that index from the table contained therein, and outputs it to first gain circuit 1230.
Second gain decoding circuit 1120 contains another table in which a plurality of gains are stored. Second gain decoding circuit 1120 receives, as its input, the index from code input circuit 1010, reads the second gain corresponding to that index from the table contained therein, and outputs it to smoothing circuit 1320.
First gain circuit 1230 receives, as its inputs, a first pitch vector, later described, outputted from pitch signal decoding circuit 1210 and the first gain outputted from first gain decoding circuit 1220, multiplies the first pitch vector by the first gain to produce a second pitch vector, and outputs the produced second pitch vector to adder 1050.
Second gain circuit 1130 receives, as its inputs, the first sound source vector from sound source signal decoding circuit 1110 and the second gain, later described, from smoothing circuit 1320, multiplies the first sound source vector by the second gain to produce a second sound source vector, and outputs the produced second sound source vector to adder 1050.
Adder 1050 calculates the sum of the second pitch vector from first gain circuit 1230 and the second sound source vector from second gain circuit 1130 and outputs the result of the addition to synthesizing filter 1040 as an excitation vector.
Storage circuit 1240 receives the excitation vector from adder 1050 and holds it. Storage circuit 1240 outputs the excitation vectors which were previously received and held thereby to pitch signal decoding circuit 1210.
Pitch signal decoding circuit 1210 receives, as its inputs, the previous excitation vectors held in storage circuit 1240 and the index from code input circuit 1010. The index specifies a delay Lpd. Pitch signal decoding circuit 1210 takes a vector for Lsfr samples corresponding to a vector length from the point going back Lpd samples from the beginning of the current frame in the previous excitation vectors to produce a first pitch signal (i.e., first pitch vector). When Lpd<Lsfr, a vector for Lpd samples is taken, and the taken Lpd samples are repeatedly connected to produce a first pitch vector with a vector length of Lsfr samples. Pitch signal decoding circuit 1210 outputs the first pitch vector to first gain circuit 1230.
Smoothing coefficient calculating circuit 1310 receives the LSP: {circumflex over (q)}j(m)(n) outputted from LSP decoding circuit 1020, and calculates an average LSP: {overscore (q)}0j(n) in n-th frame with the following equation:{overscore (q)}0j(n)=0.84·{overscore (q)}0j(n−1)+0.16·{circumflex over (q)}j(Nsfr)(n)
Next, smoothing coefficient calculating circuit 1310 calculates a variation d0(m) of the LSP for each subframe m with the following equation:
            d      0        ⁡          (      m      )        =            ∑              j        =        1                    N        p              ⁢                                                                              q                _                                            0                ⁢                j                                      ⁡                          (              n              )                                -                                                    q                ^                            j                              (                m                )                                      ⁡                          (              n              )                                                                                q            _                                0            ⁢            j                          ⁡                  (          n          )                    A smoothing coefficient k0(m) in subframe m is calculated with the following equation:k0(m)=min(0.25, max(0, d0(m)−0.4))/0.25where min(x,y) is a function which takes on a smaller one of x and y, while max(x,y) is a function which takes on a larger one of x and y. Finally, smoothing coefficient calculating circuit 1310 outputs the smoothing coefficient k0(m) to smoothing circuit 1320.
Smoothing circuit 1320 receives, as its inputs, the smoothing coefficient k0(m) from smoothing coefficient calculating circuit 1310 and the second gain from second gain decoding circuit 1120. Smoothing circuit 1320 calculates an average gain {overscore (g)}0(m) from a second gain ĝ0(m) in a subframe m with the following equation:
                    g        _            0        ⁡          (      m      )        =            1      5        ⁢                  ∑                  i          =          0                4            ⁢                                    g            ^                    0                ⁡                  (                      m            -            i                    )                    
Next, the following equation is substituted for the second gain:ĝ0(m)=ĝ0(m)·k0(m)+{overscore (g)}0(m)·(1−k0(m))
Finally, smoothing circuit 1320 outputs the substituted second gain to second gain circuit 1130.
Synthesizing filter 1040 receives, as its inputs, the excitation vector from adder 1050 and the linear prediction coefficient {circumflex over (α)}j(m)(n), j=1, . . . , Np, m=1, . . . Nsfr from linear prediction coefficient converting circuit 1030. In synthesizing filter 1040, the excitation vector drives the synthesizing filter (1/A(z)) for which the linear prediction coefficient is set to calculates a reproduced vector which is then outputted from output terminal 20.
The transfer function of synthesizing filter 1040 is represented as follows:
      1          A      ⁡              (        z        )              =      1          (              1        -                              ∑                          i              =              1                                      N              p                                ⁢                                    α              i                        ⁢                          z              i                                          )      where the linear prediction coefficient is αi, i=1, . . . , Np.
Next, a conventional speech signal coding apparatus is described. FIG. 2 is a block diagram showing an example of a configuration of a speech signal coding apparatus used in a conventional speech signal coding and decoding system. The speech signal coding apparatus is used in a pair with the speech signal decoding apparatus shown in FIG. 1 such that coded data outputted from the speech signal coding apparatus is transmitted and inputted to the speech signal decoding apparatus shown in FIG. 1. Since the operations of first gain circuit 1230, second gain circuit 1130, adder 1050 and storage circuit 1240 in FIG. 2 are similar to those of the respective corresponding functional blocks described for the speech signal decoding apparatus shown in FIG. 1, the description thereof is not repeated here.
In the apparatus shown in FIG. 2, speech signals are sampled, and a plurality of the resultant samples are formed into one vector as one frame to produce an input signal (input vector) which is then inputted from input terminal 30.
Linear prediction coefficient calculating circuit 5510 performs linear prediction analysis on the input vector supplied from input terminal 30 to derive a linear prediction coefficient. For the linear prediction analysis, reference can be made to known methods, for example, in Section 8 “Linear Predictive Coding of Speech” of “Digital Processing of Speech Signals”, L. R. Rabiner et al., Prentice-Hall, 1978 (Literature 3). Linear prediction coefficient calculating circuit 5510 outputs the derived linear prediction coefficient to LSP conversion/quantization circuit 5520.
LSP conversion/quantization circuit 5520 receives the linear prediction coefficient from linear prediction coefficient calculating circuit 5510, converts the linear prediction coefficient to an LSP, quantizes the LSP to derive the quantized LSP. For the conversion from the linear prediction coefficient to the LSP, known methods can be referenced, for example, the method described in Section 5.2.4 of Literature 2. For the quantization of the LSP, the method described in Section 5.2.5 of Literature 2 can be referenced. The quantized LSP is set to a quantized LSP:{circumflex over (q)}j(Nsfr)(n), j=1, . . . , Np in Nsfrth subframe of the current frame (n-th frame), similarly to the LSP in the LSP decoding circuit of the speech signal decoding apparatus shown in FIG. 1. The quantized LSPs from the first to (Nsfr−1)th subframes are derived by linear interpolation of {circumflex over (q)}j(Nsfr)(n) and {circumflex over (q)}j(Nsfr)(n-1). The LSP is set to an LSP in a (Nsfr−1)th subframe of the current frame (n-th frame). The LSPs from the first to (Nsfr−1)th subframes are derived by linear interpolation of qj(Nsfr)(n) and qj(Nsfr)(n−1).
LSP conversion/quantization circuit 5520 outputs the LSP: qj(m)(n), j=1, . . . , Np, m=1, . . . , Nsfr and the quantized LSP: {circumflex over (q)}j(m)(n), j=1, . . . , Np, m=1, . . . , Nsfr to linear prediction coefficient converting circuit 5030, and outputs the index corresponding to the quantized LSP: {circumflex over (q)}j(Nsfr)(n) to code output circuit 6010.
Linear prediction coefficient converting circuit 5030 receives, as its inputs, the LSP: qj(M)(n) and the quantized LSP: {circumflex over (q)}(m)(n) from LSP conversion/quantization circuit 5520, converts the LSP (qj(m)(n)) to a linear prediction coefficient [αj(m)(n), j=1, . . . , Np, m=1, . . . , Nsfr], converts the quantized LSP ({circumflex over (q)}j(m)(n)) to a quantized linear prediction coefficient: {circumflex over (α)}j(m)(n), j=1, . . . , Np, m=1, . . . , Nsfr, outputs the linear prediction coefficient αj(m)(n) to weighting filter 5050 and to weighting synthesizing filter 5040, and outputs the quantized linear prediction coefficient {circumflex over (α)}j(m)(n) to weighting synthesizing filter 5040. For the conversion from the LSP to the linear prediction coefficient and the conversion from the quantized LSP to the quantized linear prediction coefficient, known methods can be referenced, for example, the method described in Section 5.2.4 of Literature 2.
Weighting filter 5050 receives, at its inputs, the input vector from input terminal 30 and the linear prediction coefficient αj(m)(n) from linear prediction coefficient converting circuit 5030, uses the linear prediction coefficient to produce a transfer function W(z) of the weighting filter corresponding to human auditory characteristics. The weighting filter is driven by the input vector to obtain a weighted input vector. Weighting filter 5050 outputs the weighted input vector to differentiator 5060. The transfer function W(z) of the weighting filter is represented as follows:W(z)=Q(Z/γ1)/Q(Z/γ2)Here, the followings hold:
            Q      ⁡              (                  z          /                      γ            1                          )              =          1      -                        ∑                      i            =            1                                N            p                          ⁢                              α            i                          (              m              )                                ⁢                      γ            1            i                    ⁢                      z            i                                          Q      ⁡              (                  z          /                      γ            2                          )              =          1      -                        ∑                      i            =            1                                N            p                          ⁢                              α            i                          (              m              )                                ⁢                      γ            2            i                    ⁢                      z            i                              γ1 and γ2 are constants, for example, γ1=0.9 and γ2=0.6. For details on the weighting filter, Literature 1 can be referenced.
Weighting synthesizing filter 5040 receives, as its inputs, an excitation vector outputted from adder 1050, the linear prediction coefficient αj(m)(n), and the quantized linear prediction coefficient {circumflex over (α)}j(m)(n) outputted from linear prediction coefficient converting circuit 5030. The weighting synthesizing filter H(z)W(z)=Q(z/γ1)/[A(z)Q(z/γ2)] for which those are set is driven by the excitation vector to obtain a weighted reproduced vector. The transfer function H(z)=1/A(z) of the synthesizing filter is represented as follows:
      1          A      ⁡              (        z        )              =      1          (              1        -                              ∑                          i              =              1                                      N              p                                ⁢                                                    α                ^                            i                              (                m                )                                      ⁢                          z              i                                          )      
Differentiator 5060 receives, as its inputs, the weighted input vector from weighting filter 5050 and the weighted reproduced vector from weighting synthesizing filter 5040, and calculates and outputs the difference between them as a difference vector to minimization circuit 5070.
Minimization circuit 5070 sequentially outputs indexes corresponding to all sound source vectors stored in sound source signal producing circuit 5110 to sound source signal producing circuit 5110, indexes corresponding to all delays Lpd within a specified range in pitch signal producing circuit 5210 to pitch signal producing circuit 5210, indexes corresponding to all first gains stored in first gain producing circuit 6220 to first gain producing circuit 6220, and indexes corresponding to all second gains stored in second gain producing circuit 6120 to second gain producing circuit 6120. Minimization circuit 5070 also calculates the norm of the difference vector outputted from differentiator 5060, selects the sound source vector, delay, first gain and second gain which lead to a minimized norm, and outputs the indexes corresponding to the selected values to code output circuit 6010.
Each of pitch signal producing circuit 5210, sound source signal producing circuit 5110, first gain producing circuit 6220 and second gain producing circuit 6120 sequentially receives the indexes outputted from minimization circuit 5070. Since each of these pitch signal producing circuit 5210, sound source signal producing circuit 5110, first gain producing circuit 6220 and second gain producing circuit 6120 is the same as the counterpart of pitch signal decoding circuit 1210, sound source signal decoding circuit 1110, first gain decoding circuit 1220 and second gain decoding circuit 1120 shown in FIG. 1 except the connections for input and output, the detailed description of each of these blocks is not repeated.
Code output circuit 6010 receives the index corresponding to the quantized LSP outputted from LSP conversion/quantization circuit 5520, receives the indexes each corresponding to the sound source vector, delay, first gain and second gain outputted from minimization circuit 5070, converts each of the indexes to a code of bit sequences, and outputs it through output terminal 40.
The aforementioned conventional decoding apparatus and coding and decoding system have a problem of insufficient improvement in degradation of decoded sound quality in a noise period since the smoothing of the sound source gain (second gain) in the noise period fails to cause a sufficiently smooth change with time in short time average power calculated from the excitation vector. This is because the smoothing only of the sound source gain does not necessarily sufficiently smooth the short time average power of the excitation vector which is derived by adding the sound source vector (the second sound source vector after the gain multiplication) to a pitch vector (the second pitch vector after the gain multiplication).
FIG. 3 shows short time average power of an excitation signal (excitation vector) when sound source gain smoothing is performed in a noise period on the basis of the aforementioned prior art. FIG. 4 shows short time average power of an excitation signal when such smoothing is not performed. In each of these graphs, the horizontal axis represent a frame number, while the vertical axis represents power. The short time average power is calculated every 80 msec. It can be seen from FIG. 3 and FIG. 4 that, when the sound source gain is smoothed according to the prior art, the short time average power in the excitation signal after the smoothing is not necessarily smoothed sufficiently in terms of time.