This invention relates to a voice encoding and voice decoding apparatus for encoding/decoding voice at a low bit rate of below 4 kbps. More particularly, the invention relates to a voice encoding and voice decoding apparatus for encoding/decoding voice at low bit rates using an A-b-S (Analysis-by-Synthesis)-type vector quantization. It is expected that A-b-S voice encoding typified by CELP (Code Excited Linear Predictive Coding) will be an effective scheme for implementing highly efficient compression of information while maintaining speech quality in digital mobile communications and intercorporate communications systems.
In the field of digital mobile communications and intercorporate communications systems at the present time, it is desired that voice in the telephone band (0.3 to 3.4 kHz) be encoded at a transmission rate on the order of 4 kbps. The scheme referred to as CELP (Code Excited Linear Prediction) is seen as having promise in filling this need. For details on CELP, see M. R. Schroeder and B. S. Atal, xe2x80x9cCode-Excited Linear Prediction (CELP): High-Quality Speech at Very Low Bit Rates,xe2x80x9d Proc. ICASSP""85, 25.1.1, pp. 937-940, 1985. CELP is characterized by the efficient transmission of linear prediction coefficients (LPC coefficients), which represent the speech characteristics of the human vocal tract, and parameters representing a sound-source signal comprising the pitch component and noise component of speech.
FIG. 15 is a diagram illustrating the principles of CELP. In accordance with CELP, the human vocal tract is approximated by an LPC synthesis filter H(z) expressed by the following equation:                               H          ⁡                      (            z            )                          =                  1                      1            +                                          ∑                                  i                  =                  1                                p                            ⁢                                                a                  i                                ⁢                                  z                                      -                    i                                                                                                          (        1        )            
and it is assumed that the input (sound-source signal) to H(z) can be separated into (1) a pitch-period component representing the periodicity of speech and (2) a noise component representing randomness. CELP, rather than transmitting the input voice signal to the decoder side directly, extracts the filter coefficients of the LPC synthesis filter and the pitch-period component and noise component of the excitation signal, quantizes these to obtain quantization indices and transmits the quantization indices, thereby implementing a high degree of information compression.
When the voice signal is sampled at a predetermined speed in FIG. 15, input signals (voice signals) X of a predetermined number (=N) of samples per frame are input to an LPC analyzer 1 frame by frame. If the sampling speed is 8 kHz and the period of a single frame is 10 ms, then one frame is composed of 80 samples.
The LPC analyzer 1, which is regarded as an all-pole filter represented by Equation (1), obtains filter coefficients xcex1i (i=1, . . . , p), where p represents the order of the filter. Generally, in the case of voice in the telephone band, a value of 10 to 12 is used as p. LPC coefficients xcex1i (i=1, . . . , p) are quantized by scalar quantization or vector quantization in an LPC-coefficient quantizer 2, after which the quantization indices are transmitted to the decoder side. FIG. 16 is a diagram useful in describing the quantization method. Here sets of large numbers of quantization LPC coefficients have been stored in a quantization table 2a in correspondence with index numbers 1 to n. A distance calculation unit 2b calculates distance in accordance with the following equation:
d=Wxc2x7xcexa3i{xcex1q(i)xe2x88x92xcex1i}2 (i=1xcx9cp)
When q is varied from 1 to n, a minimum-distance index detector 2c finds the q for which the distance d is minimum and sends the index q to the decoder side. In this case, an LPC synthesis filter constituting an auditory weighting synthesis filter 3 is expressed by the following equation:                                           H            q                    ⁡                      (            z            )                          =                  1                      1            +                                          ∑                                  i                  =                  1                                p                            ⁢                                                                    α                    i                                    ⁡                                      (                    i                    )                                                  ⁢                                  z                                      -                    i                                                                                                          (        2        )            
Next, quantization of the sound-source signal is carried out. In accordance with CELP, a sound-source signal is divided into two components, namely a pitch-period component and a noise component, an adaptive codebook 4 storing a sequence of past sound-source signals is used to quantize the pitch-period component and an algebraic codebook or noise codebook is used to quantize the noise component. Described below will be typical CELP-type voice encoding using the adaptive codebook 4 and algebraic codebook 5 as sound-source codebooks.
The adaptive codebook 4 is adapted to successively output N samples of sound-source signals (referred to as xe2x80x9cperiodicity signalsxe2x80x9d), which are delayed by one pitch (one sample), in association with indices 1 to L. FIG. 17 is a diagram showing the structure of the adaptive codebook 4 in case of L=147, one frame, 80 samples (N=80). The adaptive codebook is constituted by a buffer BF for storing the pitch-period component of the latest 227 samples. A periodicity signal comprising 1 to 80 samples is specified by index 1, a periodicity signal comprising 2 to 81 samples is specified by index 2, . . . , and a periodicity signal comprising 147 to 227 samples is specified by index 147.
An adaptive-codebook search is performed in accordance with the following procedure: First, a bit lag L representing lag from the present frame is set to an initial value L0 (e.g., 20). Next, a past periodicity signal (adaptive code vector) PL, which corresponds to the lag L, is extracted from the adaptive codebook 4. That is, an adaptive code vector PL indicated by index L is extracted and PL is input to the auditory weighting synthesis filter 3 to obtain an output APL, where A represents the impulse response of the auditory weighting synthesis filter 3 constructed by cascade connecting an auditory weighting filter W(z) and an LPC synthesis filter Hq(z).
Any filter can be used as the auditory weighting filter. For example, it is possible to use a filter having the characteristic indicated by the following equation:                               W          ⁡                      (            z            )                          =                              1            +                                          ∑                                  i                  =                  1                                m                            ⁢                                                g                  1                  i                                ⁢                                  α                  i                                ⁢                                  z                                      -                    1                                                                                            1            +                                          ∑                                  i                  =                  1                                m                            ⁢                                                g                  2                  i                                ⁢                                  α                  i                                ⁢                                  z                                      -                    1                                                                                                          (        3        )            
where g1, g2 are parameters for adjusting the characteristic of the weighting filter.
An arithmetic unit 6 finds an error power EL between the input voice and APL in accordance with the following equation:
EL=|Xxe2x88x92xcex2APL|2xe2x80x83xe2x80x83(4)
If we let APL represent a weighted synthesized output from the adaptive codebook, Rpp the autocorrelation of APL and Rxp the cross-correlation between APL and the input signal X, then an adaptive code vector PL at a pitch lag Lopt for which the error power of Equation (4) is minimum will be expressed by the following equation:                                                                         P                L                            =                              arg                ⁢                                  xe2x80x83                                ⁢                                  max                  ⁡                                      (                                                                                            R                          2                                                ⁢                        xp                                            Rpp                                        )                                                                                                                          =                              arg                ⁢                                  xe2x80x83                                ⁢                                  max                  ⁡                                      [                                                                                            (                                                                                    X                              T                                                        ⁢                                                          AP                              L                                                                                )                                                2                                                                                                                          (                                                          AP                              L                                                        )                                                    T                                                ⁢                                                  (                                                      AP                            L                                                    )                                                                                      ]                                                                                                          (        5        )            
where T signifies a transposition. Accordingly, an error-power evaluation unit 7 finds the pitch lag Lopt that satisfies Equation (5). Optimum pitch gain xcex2opt is given by the following equation:
xcex2opt=Rxp/Rppxe2x80x83xe2x80x83(6)
Though the search range of lag L is optional, the lag range can be made 20 to 147 in a case where the sampling frequency of the input signal is 8 kHz.
Next, the noise component contained in the sound-source signal is quantized using the algebraic codebook 5. The algebraic codebook 5 is constituted by a plurality of pulses of amplitude 1 or xe2x88x921. By way of example, FIG. 18 illustrates pulse positions for a case where frame length is 40 samples. The algebraic codebook 5 divides the N (=40) sampling points constituting one frame into a plurality of pulse-system groups 1 to 4 and, for all combinations obtained by extracting one sampling point from each of the pulse-system groups, successively outputs, as noise components, pulsed signals having a +1 or a xe2x88x921 pulse at each extracted sampling point. In this example, basically four pulses are deployed per frame. FIG. 19 is a diagram useful in describing sampling points assigned to each of the pulse-system groups 1 to 4.
(1) Eight sampling points 0, 5, 10, 15, 20, 25, 30, 35 are assigned to the pulse-system group 1;
(2) eight sampling points 1, 6, 11, 16, 21, 26, 31, 36 are assigned to the pulse-system group 2;
(3) eight sampling points 2, 7, 12, 17, 22, 27, 32, 37 are assigned to the pulse-system group 3; and
(4) 16 sampling points 3, 4, 8, 9, 13, 14, 18, 19, 23, 24, 28, 29, 33, 34, 38, 39 are assigned to the pulse-system group 4.
Three bits are required to express one of the sampling points in pulse-system groups 1 to 3 and one bit is required to express the sign of a pulse, for a total of four bits. Further, four bits are required to express one of the sampling points in pulse-system group 4 and one bit is required to express the sign of a pulse, for a total of five bits. Accordingly, 17 bits are necessary to specify a pulsed signal output from the algebraic codebook 5 having the pulse placement of FIG. 18, and 217 (=24xc3x9724xc3x9724xc3x9725) types of pulsed signals exist.
The algebraic codebook search will now be described with regard to this example. The pulse positions of each of the pulse systems group are limited as illustrated in FIG. 18. In the algebraic codebook search, a combination of pulses for which the error power relative to the input voice is minimized in the reconstruction region is decided from among the combinations of pulse positions of each of the pulse systems. More specifically, with xcex2opt as the optimum pitch gain found by the adaptive codebook search, the output PL of the adaptive codebook is multiplied by the gain xcex2opt and the product is input to an adder 8. At the same time, the pulsed signals are input successively to the adder 8 from the algebraic codebook 5 and a pulsed signal is specified that will minimize the difference between the input signal X and a reconstructed signal obtained by inputting the adder output to the weighting synthesis filter 3.
More specifically, first a target vector Xxe2x80x2 for an algebraic codebook search is generated in accordance with the following equation from the optimum adaptive codebook output PL and optimum pitch gain xcex2opt obtained from the input signal X by the adaptive codebook search:
Xxe2x80x2=Xxe2x88x92xcex2optAPLxe2x80x83xe2x80x83(7)
In this example, pulse position and amplitude (sign) are expressed by 17 bits and therefore 217 combinations exist, as mentioned above. Accordingly, letting CK represent a kth algebraic-code output vector, a code vector CK that will minimize an evaluation-function error output power D in the following equation is found by a search of the algebraic codebook:
D=|Xxe2x80x2xe2x88x92xcex3ACK|2xe2x80x83xe2x80x83(8)
where xcex3 represents the gain of the algebraic codebook. Minimizing Equation (8) is equivalent to finding the CK, i.e., the k, that will minimize the following equation:                               D          xe2x80x2                =                                            (                                                X                                      xe2x80x2                    ⁢                                          xe2x80x83                                        ⁢                    T                                                  ⁢                A                ⁢                                  xe2x80x83                                ⁢                                  C                  k                                            )                        2                                                              (                                  A                  ⁢                                      xe2x80x83                                    ⁢                                      C                    k                                                  )                            T                        ⁢                          (                              A                ⁢                                  xe2x80x83                                ⁢                                  C                  k                                            )                                                          (        9        )            
The error-power evaluation unit 7 searches for k as set forth below.
If we let "PHgr"=ATA, d=Xxe2x80x2TA hold, then the above will be expressed as follows:                               D          xe2x80x2                =                                                            (                                  d                  ⁢                                      xe2x80x83                                    ⁢                                      C                    k                                                  )                            2                                                      C                k                T                            ⁢              Φ              ⁢                              xe2x80x83                            ⁢                              C                k                                              =                                    Q              k              2                                      E              k                                                          (        10        )            
If we let the elements of the impulse response be a(0), a(1), . . . , a(Nxe2x88x921) and let the elements of the target signal Xxe2x80x2 be xxe2x80x2 (0), xxe2x80x2 (1), . . . , xxe2x80x2 (Nxe2x88x921), then d will be expressed by the following equation, where N is the frame length:                                           d            ⁡                          (              n              )                                =                                    ∑                              i                =                n                                            N                -                1                                      ⁢                                                            x                  xe2x80x2                                ⁡                                  (                  i                  )                                            ⁢                              a                ⁡                                  (                                      i                    -                    n                                    )                                                                    ,                  n          =          0                ,        …        ⁢                  xe2x80x83                ,                  N          -          1                                    (        11        )            
Further, an element xcfx86(i,j) of "PHgr" is represented by the following equation:                                           φ            ⁡                          (                              i                ,                j                            )                                =                                    ∑                              n                =                j                                            N                -                1                                      ⁢                                          a                ⁡                                  (                                      n                    -                    i                                    )                                            ⁢                              a                ⁡                                  (                                      n                    -                    j                                    )                                                                    ,                  
                ⁢                  i          =                      0            ⁢                          xe2x80x83                        ⁢            …                          ⁢                  xe2x80x83                ,                  N          -          1                ,                  j          =          i                ,        …        ⁢                  xe2x80x83                ,                  N          -          1                                    (        12        )            
It should be noted that d(n) and xcfx86(i,j) are calculated before the search of the algebraic codebook.
If we let Np represent the number of pulses contained in the output vector Ck of the algebraic codebook 5, then Qk in the numerator of Equation (1) is represented by the following equation:                               Q          k                =                              ∑                          i              =              0                                      N              -              1                                ⁢                                                    s                k                            ⁡                              (                i                )                                      ⁢                          d              ⁡                              [                                                      m                    k                                    ⁡                                      (                    i                    )                                                  ]                                                                        (        13        )            
where Sk(i) is the pulse amplitude (+1 or xe2x88x921) in the ith pulse system of Ck and mk(i) represents the position of the pulse. Further, the denominator Ek of Equation (10) is found by the following equation:                               E          k                =                                            ∑                              i                =                0                                            N                -                1                                      ⁢                          φ              ⁡                              [                                                                            m                      k                                        ⁡                                          (                      i                      )                                                        ,                                                            m                      k                                        ⁡                                          (                      i                      )                                                                      ]                                              +                      2            ⁢                                          ∑                                  i                  =                  0                                                  N                  -                  2                                            ⁢                                                ∑                                      j                    =                                          i                      +                      1                                                                            N                    -                    1                                                  ⁢                                                                            s                      k                                        ⁡                                          (                      i                      )                                                        ⁢                                                            s                      k                                        ⁡                                          (                      j                      )                                                        ⁢                                      φ                    ⁡                                          [                                                                                                    m                            k                                                    ⁡                                                      (                            i                            )                                                                          ,                                                                              m                            k                                                    ⁡                                                      (                            j                            )                                                                                              ]                                                                                                                              (        14        )            
It is also possible to conduct a search using Qk in Equation (13) and Ek in Equation (14). However, in order to reduce the amount of processing involved in the search, Qk and Ek are transformed through the following procedure: First, d(n) is split into two portions, namely its absolute value |d(n)| and sign sign[d(n)]. Next, the sign information of d(n) is included in "PHgr" by the following equation:
xcfx86xe2x80x2(i,j)=sign[d(i)]sign[d(j)]xcfx86(i,j), i=0, . . . Nxe2x88x921, j=i+1, . . . Nxe2x88x921xe2x80x83xe2x80x83(15)
In order to eliminate the constant 2 in the second term of Equation (14), the main diagonal component of "PHgr" is scaled by the following equation:
xcfx86xe2x80x2(i,i)=xcfx86xe2x80x2(i,i)/2, i=0, . . . Nxe2x88x921xe2x80x83xe2x80x83(16)
Accordingly, the numerator Qk is simplified as indicated by the following equation:                               Q          k          xe2x80x2                =                              ∑                          i              =              0                                      N              -              1                                ⁢                      |                          d              ⁡                              [                                                      m                    k                                    ⁡                                      (                    i                    )                                                  ]                                      |                                              (        17        )            
Further, the denominator Ek is simplified as indicated by the following equation:                                                                         E                k                xe2x80x2                            =                                                E                  k                                /                2                                                                                        =                                                                    ∑                                          i                      =                      0                                                              N                      -                      1                                                        ⁢                                                            φ                      xe2x80x2                                        ⁡                                          [                                                                                                    m                            k                                                    ⁡                                                      (                            i                            )                                                                          ,                                                                              m                            k                                                    ⁡                                                      (                            i                            )                                                                                              ]                                                                      +                                                      ∑                                          i                      =                      0                                                              N                      -                      2                                                        ⁢                                                            ∑                                              j                        =                                                  i                          +                          1                                                                                            N                        -                        1                                                              ⁢                                                                                            s                          k                                                ⁡                                                  (                          i                          )                                                                    ⁢                                                                        s                          k                                                ⁡                                                  (                          j                          )                                                                    ⁢                                                                        φ                          xe2x80x2                                                ⁡                                                  [                                                                                                                    m                                k                                                            ⁡                                                              (                                i                                )                                                                                      ,                                                                                          m                                k                                                            ⁡                                                              (                                j                                )                                                                                                              ]                                                                                                                                                                            (        18        )            
Accordingly, the output of the algebraic codebook can be obtained by calculating the numerator Qkxe2x80x2 and denominator Ekxe2x80x2 in accordance with Equations (17), (18) while changing the position of each pulse, and deciding the pulse position for which Dxe2x80x3=Qkxe2x80x22/Ekxe2x80x2 is maximized.
Next, quantization of the gains xcex2opt, xcex3opt is carried out. The gain quantization method is optional and a method such as scalar quantization or vector quantization can be used. For example, it is so arranged that xcex2, xcex3 are quantized and the quantization indices of the gain are transmitted to the decoder through a method similar to that employed by the LPC-coefficient quantizer 2.
Thus, an output information selector 9 sends the decoder (1) the quantization index of the LPC coefficient, (2) pitch lag Lopt, (3) an algebraic codebook index (pulsed-signal specifying data), and (4) a quantization index of gain.
Further, after all search processing and quantization processing in the present frame is completed, and before the input signal of the next frame is processed, the state of the adaptive codebook 4 is updated. In state updating, a frame length of the sound-source signal of the oldest frame (the frame farthest in the past) in the adaptive codebook is discarded and a frame length of the latest sound-source signal found in the present frame is stored. It should be noted that the initial state of the adaptive codebook 4 is the zero state, i.e., a state in which the amplitudes of all samples are zero.
Thus, as described above, the CELP system produces a model of the speech generation process, quantizes the characteristic parameters of this model and transmits the parameters, thereby making it possible to compress speech efficiently.
It is known that CELP (and improvements therein) makes it possible to realize high-quality reconstructed speech at a bit rate on the order of 8 to 16 kbps. Among these schemes, ITU-T Recommendation G.729A (CS-ACELP) makes it possible to achieve a sound quality equal to that of 32-kbps ADPCM on the condition of a low bit rate of 8 kbps. From the standpoint of effective utilization of the communication channel, however, there is now a need to implement high-quality reconstructed speech at a very low bit rate of less than 4 kbps.
The simplest method of reducing bit rate is to raise the efficiency of vector quantization by increasing frame length, which is the unit of encoding. The CS-ACELP frame length is 5 ms (40 samples) and, as mentioned above, the noise component of the sound-source signal is vector-quantized at 17 bits per frame. Consider a case where frame length is made 10 ms (=80 samples), which is twice that of CS-ACELP, and the number of quantization bits assigned to the algebraic codebook per frame is 17.
FIG. 20 illustrates an example of pulse placement in a case where four pulses reside in a 10-ms frame. The pulses (sampling points and polarities) of first to third pulse systems in FIG. 20 are each represented by five bits and the pulses of a fourth pulse system are represented by six bits, so that 21 bits are necessary to express the indices of the algebraic codebook. That is, in a case where the algebraic codebook is used, if frame length is simply doubled to 10 ms, the combinations of pulses increase by an amount commensurate with the increase in positions at which pulses reside unless the number of pulses per frame is reduced. As a consequence, the number of quantization bits also increases.
In the case of this example, the only method available to make the number of bits of the algebraic codebook indices equal to 17 is to reduce the number of pulses, as illustrated in FIG. 21 by way of example. However, on the basis of experiments performed by the Inventor, it has been found that the quality of reconstructed speech deteriorates markedly when the number of pulses per frame is made three or less. This phenomenon can be readily understood qualitatively. Specifically, if there are four pulses per frame (FIG. 18) in a case where the frame length is 5 ms, then eight pulses will be present in 10 ms. By contrast, if there are three pulses per frame (FIG. 21) in a case where the frame length is 10 ms, then naturally only three pulses will be present in 10 ms. As a consequence, the noise property of the sound-source signal to be represented in the algebraic codebook cannot be expressed and the quality of reconstructed speech declines.
Thus, even if frame length is enlarged to reduce the bit rate, the bit rate cannot be reduced unless the number of pulses per frame is reduced. If the number of pulses is reduced, however, the quality of reconstructed speech deteriorates by a wide margin. Accordingly, with the method of raising the efficiency of vector quantization simply by increasing frame length, achieving high-quality reconstructed speed at a bit rate of 4 kbps is difficult.
Accordingly, an object of the present invention is to make it possible to reduce the bit rate and reconstruct high-quality speech.
In CELP, an encoder sends a decoder (1) a quantization index of an LPC coefficient, (2) pitch lag Lopt of an adaptive codebook, (3) an algebraic codebook index (pulsed-signal specifying data), and (4) a quantization index of gain. In this case, eight bits are necessary to transmit the pitch lag. If pitch lag need not be sent, therefore, the number of bits used to express the algebraic codebook index can be increased commensurately. In other words, the number of pulses contained in the pulsed signal output from the algebraic codebook can be increased and it therefore becomes possible to transmit high-quality voice code and to achieve high-quality reproduction. It is generally known that a steady segment of speech is such that the pitch period varies slowly. The quality of reconstructed speech will suffer almost no deterioration in the steady segment even if pitch lag of the present frame is regarded as being the same as pitch lag in a past (e.g., the immediately preceding) frame.
According to the present invention, therefore, there are provided an encoding mode 1 that uses pitch lag obtained from an input signal of a present frame and an encoding mode 2 that uses pitch lag obtained from an input signal of a past frame, a first algebraic codebook having a small number of pulses is used in the encoding mode 1 and a second algebraic codebook having a large number of pulses is used in the encoding mode 2. When encoding is performed, an encoder carries out encoding frame by frame in each of the encoding modes 1 and 2 and sends a decoder a code obtained by encoding an input signal in whichever mode enables more accurate reconstruction of the input signal. If this arrangement is adopted, the bit rate can be reduced and it becomes possible to reconstruct high-quality speech.
Further, there are provided an encoding mode 1 that uses pitch lag obtained from an input signal of a present frame and an encoding mode 2 that uses pitch lag obtained from an input signal of a past frame, a first algebraic codebook having a small number of pulses is used in the encoding mode 1 and a second algebraic codebook in which the number of pulses is greater than that of the first algebraic codebook is used in the encoding mode 2. When encoding is performed, the optimum mode is decided based upon a property of the input signal, e.g., the periodicity of the input signal, and encoding is carried out on the basis of the mode decided. If this arrangement is adopted, the bit rate can be reduced and it becomes possible to reconstruct high-quality speech.