The invention relates to the coding of speech at variable bit rates, whereby the bit rates can vary from frame to frame, and more specifically to the methods and filters used for improving the quality of decoded speech.
The coding of speech at a variable bit rate can be used to maximize the capacity of a data transfer connection at a certain level of speech quality, or to minimize the average bit rate of a speech connection. This is possible because speech is not homogeneous, and if speech is divided into short sections, different sections can be presented using a different number of bits in each section without a perceivable difference in quality. Codecs using a fixed bit rate must operate at a kind of compromise rate, which is not too high in order to save data transfer capacity, but high enough to present different parts of speech with sufficient quality. This compromise rate is needlessly high for the sounds that could be presented with a smaller number of bits. The variable-rate method of speech coding can be used to advantage in many applications. Packet-switched networks, such as internet, can use variable-rate communications directly by sending different sized packages. The Code Division Multiple Access (CDMA) systems can also directly utilize variable-rate coding. In the CDMA systems, the average fall of the transmission rate reduces the mutual disturbances caused by different transmissions and makes it possible to increase the number of users. In the so-called third generation mobile station systems, variable-rate data transfer is likely to be used in some form. In addition to data transfer, variable-rate coding is also useful in connection with voice recording and voice message systems, such as telephone answering machines, where the saving due to variable-rate coding is seen as saved recording capacity.
The bit rate of a variable-rate codec can be controlled in many ways. One way is based on monitoring the capacity of the data transfer network, whereby the momentary bit rate is determined according to the available capacity. In a system like this, the bit rate can also be set an upper and lower limit on the basis of the capacity in use. The limits of the capacity are seen as reduced speech quality particularly during times of congestion, when the system forces the bit rate down.
Variable-rate coding can also be used to implement an error-tolerant coding method for mobile stations. In a method like this, the bit rate of speech coding is adapted on the basis of the quality of the transmission channel. When the quality of the transmission channel is good, the bit rate is kept relatively high and in addition to the coded speech only a little error correction information is transferred. In good transmission conditions, this method is sufficient to remove transmission errors. When the quality of the transmission channel becomes worse, the bit rate is lowered, whereby stronger channel coding can be used in an ordinary fixed-rate transmission channel. Then the reduction of speech quality is minimized by means of this stronger channel coding, which can correct larger errors. However, speech quality is reduced somewhat when the quality of the transmission connection is weakened, because the bit rate is lowered.
A typical CELP coder (Code Excited Linear Prediction) comprises many filters modelling speech formation, for which a suitable excitation signal is selected from the excitation vectors contained by the codebook. A CELP coder includes typically both short-term and long-term filters, in which a synthesized version of the original speech signal is formed by filtering excitations selected from the codebook. An excitation vector producing the optimum excitation signal is sought from the excitation vectors of the codebook. During the search, each excitation vector is applied to the synthesizer, which includes both short-term and long-term filters. The synthesized speech signal is compared to the original speech signal, taking account of the response of the human hearing capacity, whereby a characteristic comparable to the observed speech quality is obtained. An optimum excitation vector is obtained for each part of the speech signal being processed by selecting from the codebook the excitation vector which produces the smallest weighted error signal for the part of the speech signal in question. CELP coders like this are described in more detail in the patent specification U.S. Pat. No. 5,327,519, for instance.
FIG. 1 shows an example of a block diagram of a prior art fixed-rate CELP coder. The coder comprises two analysis blocks, namely the short-term analysis block 10 and the long-term analysis block 11. These analyse the speech signal s(n) to be coded, the short-term analysis block mostly the formants of the spectrum of the speech signal and the long-term analysis block mostly the periodicity (pitch) of the speech signal. The blocks form multiplier sets a(i) and b(i), which determine the filtering properties of the short-term and long-term filter blocks. The multiplier set a(i) formed by the short-term analysis block corresponds to the formants of the spectrum of the speech signal to be coded, and the multiplier set b(i) formed by the long-term analysis block corresponds to the periodicity (pitch) of the speech signal to be coded. The multiplier sets a(i) and b(i) are sent to the receiver through the data transfer channel 5. The multiplier sets are calculated separately for each frame of the speech signal to be coded, the temporal length of the frames being typically 20 ms.
The long and short-term filter blocks 13, 12 filter excitations selected from the codebook according to the multiplier sets a(i) and b(i). The long-term filter thus models the periodicity (pitch) of the voice, or the vibration of the vocal cords, and the short-term filter models the formants of the spectrum, or the human voice formation channels. The filtering result ss(n) is reduced from the speech signal s(n) to be coded in the summing device 18. The residual signal e(n) is taken to the weighting filter 14. The properties of the weighting filter are chosen according to the human hearing capacity. The weighting filter attenuates the frequencies which are perceptually less important, and emphasizes those frequencies which have a substantial effect on the perceived speech quality. The code vector search control block 15 searches on the basis of the output signal of the weighting filter a corresponding excitation vector index u. The excitation codebook 16 forms the desired excitation on the basis of the code vector corresponding to the index, and the excitation is fed to the multiplication device 17. The multiplication device forms the product of the excitation and the weighting factor g of the excitation given by the code vector search control block, which product is fed to the filter blocks 12, 13. The code vector search control block searches iteratively for an optimum excitation code vector. When the residue signal e(n) is at the minimum or sufficiently small, the desired code vector is considered to be found, whereby the index u of the excitation code vector and the weighting factor g are sent to the receiver.
FIG. 2 shows an example of a block diagram of a prior art CELP decoder. The decoder receives the coding parameter sets a(i) and b(i), the weighting factor g and the excitation code vector index u from the data transfer channel 5. An excitation code vector corresponding to the index u is selected from the excitation codebook, and a corresponding excitation c(n) is multiplied in the multiplication device 21 with the weighting factor g. The resulting signal is fed to the long-term synthesizing filter 22 and further to the short-term synthesizing filter 23. The coding parameter sets a(i) and b(i) control the filters 22, 23 in the same way as in the coder of FIG. 1. The output signal of the short-term filter is filtered further in a postfilter 24 for forming a reconstructed speech signal sxe2x80x2(n).
In a modification of CELP coding, namely the ACELP (algebraic code excited linear prediction), the excitation signal consists of a constant number of pulses differing from zero. An optimum excitation signal is obtained by selecting the optimum places and amplitudes of pulses with similar error criteria as in CELP coding. Coding like this is described e.g. in the conference publications Jxc3xa4rvinen K., Vainio J., Kapanen P., Honkanen T., Haavisto P., Salami R., Laflamme C. and Adoul J-P, GSM Enhanced Full Rate Speech Codec, International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, Apr. 21-24, 1997, and Honkanen T., Vainio J., Jarvinen K., Haavisto P., Salami R., Laflamme C. and Adoul J-P., Enhanced Full Rate Speech Codec for IS-136 Digital Cellular System, International Conference on Acoustics, Speech and Signal Processing, Munich, Germany, Apr. 21-24, 1997.
It is typical of low bit-rate codecs like this that because of inaccurate excitation modelling the voice quality as such would be poor. Because of this, the output signal of the codec is filtered in order to improve the perceivable speech quality. Both short and long-term filtering can be used in postfiltering like this. The filtering properties are regulated by means of weighting factors. The purpose of short-term postfiltering is to emphasize the formants of the spectrum and thus attenuate the frequencies surrounding them, which improves the perceived quality of speech. The purpose of long-term postfiltering is to emphasize the fine features of the spectrum. An example is a fixed 10th degree short-term postfilter, which is of the form                               H          ⁡                      (            z            )                          =                                                            ∑                10                                            i                =                0                                      ⁢                                          α                i                            ⁢                              b                i                            ⁢                              z                                  -                  i                                                                                                        ∑                10                                            i                =                0                                      ⁢                                          β                i                            ⁢                              c                i                            ⁢                              z                                  -                  i                                                                                        (        1        )            
wherein bi and ci are the determining factors of the short-term spectrum of the frame to be analyzed, and xcex1 and xcex2 are weighting factors that regulate filtering. The weighting factors move the zeroes and poles of the short-term model of the filter closer to the origin. The values of the weighting factors are chosen individually for each codec type typically by means of listening tests. A postfilter like this can be weakened by moving the filter poles closer to the origin by reducing the value of the factor xcex2 and/or moving the zeroes of the filter closer to the unit circle by increasing the value of the factor xcex1. A short-term postfilter can also be realized by means of a transfer function having only poles or zeroes.
It is a known fact that the lower the bit rate used in speech coding, the stronger postfiltering is needed to mask the distortion caused by coding. However, in the prior art variable-rate codecs, the same postfilter has been used with all bit rates. An example of a variable-rate codec like this is the QCELP codec, which is used in the IS-96 CDMA system.
However, the patent specification U.S. Pat. No. 4,617,676 disclosesxe2x80x94in connection with ADPCM coding (Adaptive Differential Pulse Code Modulation)xe2x80x94a solution in which different weighting factors are used in the postfilter for speech signals coded at different bit rates. According to the specification, the weighting factors are changed while the bit rate used for coding is changed.
Using different postfilters for different bit rates entails the problem that when the bit rate and the postfilter are changed, the tone of the speech is also changed. The listener perceives this as discontinuity and disturbance. Because of this, in the prior art variable-rate codecs the weighting factors of the postfilter are typically kept constant. A postfilter that is adjusted according to the bit rate of each frame causes disturbances both in coding that takes place sample by sample (such as ADPCM) and in coding that takes place frame by frame (CELP).
FIG. 3 shows a prior art adaptive postfilter as applied to an LD-CELP decoder according to the standard ITU-T G.728. The parameters and intensity of the pitch of decoded speech are analysed in the analysis block 40. These results are used to control the operation of the long-term postfilter block 42. The transfer function of the long-term postfilter block 42 is
Hl(z)=gi(1+bzxe2x88x92p)xe2x80x83xe2x80x83(2)
wherein p is the pitch-lag, b is the filter weighting factor and gi is the scaling factor. Suitable values for b and gi are, for example:                     b        =                  {                                                                      0                  ,                                                                              β                   less than                   0.6                                                                                                                          0.15                    ⁢                    β                                    ,                                                                              0.6                  ≤                  β                  ≤                  1                                                                                                      0.15                  ,                                                                              β                   greater than                   1                                                                                        (        3        )                                          g          1                =                  1                      1            +            b                                              (        4        )            
wherein xcex2 is the amplification factor of the single tap pitch predictor, whereby the pitch-lag is p samples. The pitch postfilter is constructed as a comb filter, in which the resonance peaks are at multiples of the pitch frequency of the speech being postfiltered. The transfer function of the short-term postfilter 43 is                               H          ⁡                      (            z            )                          =                                                            ∑                10                                            i                =                0                                      ⁢                                          γ                1                i                            ⁢                              a                i                            ⁢                              z                                  -                  i                                                                                                        ∑                10                                            i                =                0                                      ⁢                                          γ                2                i                            ⁢                              a                i                            ⁢                              z                                  -                  i                                                                                        (        5        )            
wherein the weighting factor parameters xcex31=0.65 and xcex32=0.75 regulate the strength of the postfiltering and the factors a are the parameters that determine the short-term spectrum. Postfiltering can further be regulated by means of the tilt factor Hxe2x80x2(z) as follows:                                           H            xe2x80x2                    ⁡                      (            z            )                          =                              H            ⁡                          (              z              )                                ⁢                      1                          1              +                              μ                ⁢                                  xe2x80x83                                ⁢                                  z                                      (                                          -                      i                                        )                                                                                                          (        6        )            
wherein xcexc=xcex33k1, wherein k1 is also the first reflection factor of a model for the short-term analysis block used in speech coding. The factors of the short-term model are obtained from the decoder. Because the gain of the signal can change in postfiltering, automatic gain control is used to keep the gain constant. The gain of decoded speech xc2x7(n) is determined in the scaling factor computation block 41, after which the gain of the postfiltered speech sxe2x80x2(n) is adjusted to correspond to the gain of the decoded speech in the scaling block 44. The scaling factor of each frame is typically calculated according to the formula:   g  =                                          ∑            L                                n            =            0                          ⁢                              s            2                    ⁡                      (            n            )                                                            ∑            L                                n            =            0                          ⁢                              s            f            2                    ⁡                      (            n            )                              
wherein xc2x7(n) is the decoded speech signal, sf is the signal after the short and long-term postfiltering blocks and L is the length of the frame to be analyzed. The scaling block 44 performs the multiplication
sxe2x80x2(n)=gsf(n)xe2x80x83xe2x80x83(8)
In the GSM EFR standard, the weighting factors are xcex31=0.7, xcex32=0.75 and xcex33=0.15.
FIG. 4 shows a variable-rate coder controlled by the source signal and the data transfer network. The coding block 20 receives the speech signal to be coded s(n). The speech signal to be coded is also taken to the bit rate control block 21, which controls the bit rate according to the speech signal s(n). The control block 21 also receives a control signal O, which typically determines the highest and lowest allowed bit rate and the desired average bit rate. In addition to this information, the control block 21 can receive information of the quality of the coding and the quality of the data transfer channel and use this information for controlling the bit rate. For example, if the quality of the data transfer channel is bad, it is advantageous to lower the bit rate, whereby a stronger channel coding can be used. The data transfer channel is used to convey information of the parameters used by the coder, such as the bit rate, to the recipient.
FIG. 5 illustrates how the bit rate of a variable-rate coder controlled by a source signal, as in the example of FIG. 4, varies according to the source signal. The upper curve represents the speech signal and the lower curve the bit rate used by the coder. In principle, the bit rate can vary frame by frame. In the example of FIG. 5, the average bit rate is about 7.0 kbit/s.
The postfilter solutions used in variable-rate codecs entail yet another problem, which is not taking into account whether the sound in each frame is voiced, unvoiced or whether it is merely background noise. This problem arises particularly with low bit rates, which require a strong postfilter. Strong postfiltering distorts particularly the sound colour of unvoiced frames and frames containing only background noise. In frames like this, the signal spectrum is rather even and lacking of clear formants, which tend to be formed as a result of strong postfiltering. Thus the speech signal is easily distorted during frames like this, which is perceived by the listener as weakened quality of speech.
It is an object of the invention to improve the quality of speech in a telecommunication system which uses variable-rate speech coding. It is also an object of the invention to improve the quality of a speech signal decoded from a coded signal. In addition, the invention aims at improving the tolerance of a telecommunication system with respect to data transfer errors.
The objects are achieved by realizing a postfiltering system in which the postfiltering is adapted at least according to the long-term average bit rate, and by realizing a corresponding adaptive postfilter which adapts itself at least according to the long-term average bit rate.
The method according to the invention is characterized in what is stated in the characterizing part of the independent method claim. The invention also relates to a decoding system, which is characterized in what is stated in the characterizing part of the independent claim concerning a decoding system. The invention also relates to a mobile station, which is characterized in what is stated in the characterizing part of the independent claim concerning a mobile station. Furthermore, the invention relates to an element of a telecommunication system, which element is characterized in what is stated in the characterizing part of the independent claim concerning an element of a telecommunication system. The subclaims describe various advantageous embodiments of the invention.
In the solution according to the invention, the weighting factors of the postfilter are not adjusted according to the momentary bit rate, or the bit rate used in the coding of each frame, but the weighting factors are adjusted according to an average bit rate calculated for a certain period of time, for instance by calculating the average over several frames. In addition to this, the weighting factors of the postfilter are also adjusted according to whether each frame contains a voiced speech signal, unvoiced speech signal or background noise. Postfiltering is weakened at frames containing unvoiced speech signal or background noise, so that the tone of the signal would not be distorted at places like that because postfiltering is adapted to a voiced signal. In addition, the weighting factors of the postfilter can also be adapted on the basis of the error rate of the received signal or another signal or a parameter describing the quality of the data transfer channel. For example, postfiltering can advantageously be adjusted so that when the bit error rate increases, postfiltering is strengthened, whereby the effect of data transfer errors in the decoded speech signal is reduced and the tolerance of the system with regard to data transfer errors increases.