1. Field of the Invention
The present invention relates to a voice coding/decoding technology based on A-b-s (Analysis-by-Synthesis) vector quantization.
2. Description of the Related Art
The voice coding system represented by the CELP (Code Excited Linear Prediction) coding system based on the A-b-s vector quantization is applied when the transmission rate of a PCM voice signal is compressed from, for example, 64 Kbits/sec (kilobits/seconds) to approximately 4 through 16 kbits/sec. The voice coding system is demanded as a system for compressing information while maintaining voice quality in an in-house communications system, a digital mobile radio system, etc.
FIG. 1 shows the conventional A-b-S vector quantization system. 51 is a code book, 52 is a gain unit, 53 is a linear prediction synthesis filter, 54 is a subtracter, and 55 is an error power evaluation unit.
In an A-b-S vector quantization coder, the gain unit 52 first multiplies the code vector C read from the code book 51 by a gain g. Then, the linear prediction synthesis filter 53 inputs the above described the scaled code vector, and outputs a reproduced signal gAC. Then, the subtracter 54 subtracts the reproduced signal gAC from an input signal X, thereby outputting an error signal E which indicates the difference between them. Furthermore, the error power evaluation unit 55 computes an error power according to an error signal E. The above described process is performed on all code vectors C in the code book 51 with optimal gains g, the index of the code vector C and the gain g which generate the smallest error power are computed, and they are transmitted to a decoder.
In an A-b-S vector quantization decoder, the code vector C corresponding to the index transmitted from the coder is read from the code book 51. Then, the gain unit 52 scales the code vector C by the gain g transmitted from the coder. Then, the linear prediction synthesis filter 53 inputs the scaled code vector, and outputs the decoded regenerated signal gAC. The decoder does not require the subtracter 54 and the error power evaluation unit 55.
As described above, in the A-b-S vector quantization coder, an analyzing process is performed while a synthesizing (decoding) process is performed on a code vector C
FIG. 2 shows a typical conventional CELP system based on the above described A-b-S vector quantization system.
In this CELP system, two types of code books, that is, an adaptive code book corresponding to a periodic (pitch) sound source and a fixed code book corresponding to a noisy (random) sound source. According to this system, an A-b-S vector quantizing process mainly for the periodic voice (voiced sound, etc.) and a succeeding A-b-S vector quantizing process mainly for a noisy voice (unvoiced sound, background sound, etc.) are sequentially performed based on respective code books.
In FIG. 2, 61 is a fixed code book, 62 is an adaptive code book, 63 and 64 are gain units, 65 and 66 are linear prediction synthesis filters, 67 and 68 are error power evaluation units, and 69 and 70 are subtracters. Each of the fixed code book 61 corresponding to a random sound source and the adaptive code book 62 corresponding to a pitch sound source are contained in the memory. The gain units 63 and 64, the linear prediction synthesis filters 65 and 66, the error power evaluation units 67 and 68, and the subtracters 69 and 70 can be realized by operation elements such as a DSP (digital signal processor), etc.
In the CELP coder with the above described configuration, the portion comprising the adaptive code book 62, the gain unit 64, the linear prediction synthesis filter 66, the subtracter 70, and the error power evaluation unit 68 outputs a transmission parameter effective for periodic voice. P indicates an adaptive code vector output from the adaptive code book, b indicates a gain in the gain unit 64, and A indicates the transmission characteristic of the linear prediction synthesis filter 66.
The coding process performed by this portion is based on the same principle as the coding process performed by the code book 51, the gain unit 52, the linear prediction synthesis filter 53, the subtracter 54, and the error power evaluation unit 55. However, a sample in the adaptive code book 62 adaptively changes by the feedback of a previous excitation signal. The decoder performs a process similar to the process performed by the decoding process by the code book 51, the gain unit 52, and the linear prediction synthesis filter 53 described above by referring to FIG. 1. However, in this case, a sample in the adaptive code book 62 also changes adaptively by the feedback of a previous excitation signal.
On the other hand, the portion comprising the fixed code book 61, the gain unit 63, the linear prediction synthesis filter 65, the subtracter 69, and the error power evaluation unit 67 outputs a transmission parameter effective for the noisy signal X′ output by the subtracter 70 subtracting the optimum reproduced signal bAP output by the linear prediction synthesis filter 66 from the input signal X. The coding process by this portion is based on the same principle as the coding process by the code book 51, the gain unit 52, the linear prediction synthesis filter 53, the subtracter 54, and the error power evaluation unit 55. In this case, the fixed code book 61 preliminarily stores a fixed sample. The decoder performs a process similar to the process performed by the decoding process by the code book 51, the gain unit 52, and the linear prediction synthesis filter 53 described above by referring to FIG. 1.
The fixed code book 61 preliminarily stores a random code vector C corresponding to a fixed sample value. Therefore, for example, assuming that a vector dimension length is 40 (corresponding to the number of samples in the period of 5 msec (milliseconds) when the sampling frequency is 8 kHz), and that the number of vector:code book size is 1024, the fixed code book 61 requires the memory capacity of 40 k (kilo) words.
That is, a large memory capacity is required by the fixed code book 61 to independently store all sample values. This is a big problem to be solved when the CELP voice codec is realized.
To solve this problem, an ACELP (Algebraic Code Excited Linear Prediction) system has been suggested to successfully perform the code book searching process in an algebraic method by arranging a small number of non-zero sample values at fixed positions (refer to J. P. Adoul et al. ‘Fast CELP coding based on algebraic codes’ Proc. IEEE International conference on acoustics speech and signal processing, pp. 1957-1960 (April, 1987)).
FIG. 3 shows the configuration of the conventional ACELP system using an algebraic code book. An algebraic code book 71 corresponds to the fixed code book 61 shown in FIG. 2, a gain unit 72 corresponds to the gain unit 63 shown in FIG. 2, a linear prediction synthesis filter 73 corresponds to the linear prediction synthesis filter 65 shown in FIG. 2, a subtracter 74 corresponds to the subtracter 69 shown in FIG. 2, and an error power evaluation unit 75 corresponds to the error power evaluation unit 67 shown in FIG. 2. In the A-b-S process shown in FIG. 3, as in the processes described by referring to FIGS. 1 or 2, an A-b-S process is performed using the code vector Ci generated from the algebraic code book 71 corresponding to an index i, and a gain g.
In this ACELP system, the required amount of operations and memory can be considerably reduced by limiting the amplitude value and position of a non-zero sample. At this time, for example, as shown in FIG. 4, the N-dimensional M-size algebraic code book 71 storing code vectors C0, C1, . . . , Cm-1 is provided. However, since the number of non-zero samples in a frame is fixed and the non-zero samples are arranged at equal intervals, each of the code vectors C0, C1, . . . , Cm-1 can be generated in an algebraic method. In the example shown in FIG. 4, the sample position of each of the four non-zero samples i0, i1, i2, and i3 is standardized, and the amplitude value is ±1.0. The amplitude of the sample position other than the four sample positions is assumed to be zero.
As shown on the right of the algebraic code book 71 shown in FIG. 4, the sample value pattern of the code vector corresponding to i0, i1, i2, and i3 depends on the sample positions i0, i1, i2, and i3 within the amplitude of ±1 excluding the sample position having the amplitude of zero, for example, the pattern corresponding to the code vector C0 (0, . . . 0, +1, 0, . . . , 0, −1, 0, . . . , 0, +1, 0, . . . , 0, −1, 0, . . . ). That is, for the code vector having, as elements, a total of N samples of four non-zero samples and N−4 zero samples, each of the four non-zero samples in (n=0, 1, 2, 3) can be expressed by a total of K+1 bits, that is, 1 bit for amplitude information (the absolute value of the amplitude is fixed to 1, and indicates only the polarity), and K bits for the position information mn specifying one of 2k candidates.
The position of a non-zero sample is standardized by the G.729 or G.723.1 of the ITU-T (International Telecommunication Union-Telecommunication Standardization Secter).
For example, in the table 77 shown in FIG. 4 corresponding to the standard G.729, each position information m0 through m2 about non-zero samples i0 through i2 in 40 samples corresponding to 1 frame has candidates at 8 positions. One position can be specified by 3 bits. The position information m3 about a non-zero sample i3 has candidates at 16 positions, and can be expressed by 4 bits to specify one of the positions. Each piece of the amplitude information s0 through s3 about the non-zero samples i0 through i3 can be expressed by 1 bit because the absolute value of each amplitude is fixed to 1.0, and the polarity is represented. Therefore, in G.729, the non-zero samples i0 through i3 can be formed by 17-bit data comprising the amplitude information s0 through s3 each being formed by 1 bit and the position information m0 through m3 each being formed by 3 or 4 bits as shown by 76 in FIG. 4.
In the table 78 shown in FIG. 4 corresponding to the standard 723.1, each position candidate of the non-zero samples i0 through i3 is determined such that the position is assigned to every second sample in the non-zero samples. Thus, each piece of the position information m0 through m3 about the non-zero samples i0 through i3 can be expressed by 3 bits. As in the standard G.729, each piece of the amplitude information s0 through s3 about the non-zero samples i0 through i3 can be expressed by 1 bit. As described above, in G.723.1, the non-zero samples i0 through i3 can be formed by 16-bit data comprising the amplitude information s0 through s3 each being formed by 1 bit and the position information m0 through m3 each being formed by 3 bits as shown by 76 in FIG. 4.
For example, when the i-th coded word has the value sin,min (where n=0, 1, 2, 3), the coded word sample ci (n) can be defined by the following equation.                                                                                           c                  i                                ⁡                                  (                  n                  )                                            =                            ⁢                                                                    s                    0                    i                                    ⁢                                      δ                    ⁡                                          (                                              n                        -                                                  m                          1                          i                                                                    )                                                                      +                                                      s                    1                    i                                    ⁢                                      δ                    ⁡                                          (                                              n                        -                                                  m                          1                          i                                                                    )                                                                      +                                                                                                      ⁢                                                                    s                    2                    i                                    ⁢                                      δ                    ⁡                                          (                                              n                        -                                                  m                          2                          i                                                                    )                                                                      +                                                      s                    3                    i                                    ⁢                                      δ                    ⁡                                          (                                              n                        -                                                  m                          3                          i                                                                    )                                                                                                                              (        1        )            
where sin indicates the amplitude information about a non-zero sample, and min indicates the position information about a non-zero sample. In addition, δ ( ) indicates a delta function, and the following equations exist.δ(n)=1 for n=0δ(n)=0 for n≠0
In addition, the error power E2 can be expressed by the following equation using the input signal shown in FIG. 3, the gain g, the code vector Ci, and the matrix H of the impulse response of the linear prediction synthesis filter 73.E2=(X−gHCi)2  2
The evaluation function argmax (Fi) for obtaining the minimum error power E2 can be expressed by the following equation.
 argmax (Fi)=[(XTHCi)2/{(HCi)T(HCi)}]  3
where assuming that:XTH=D=d(i)  4, andHTH=Φ=φ(i,j)  5
the evaluation function argmax (fi) expressed by the equation 3 can be expressed by the following equation.argmax (Fi)=[(DTCi)2/{(Ci)TΦCi}]  6
where the characters in the upper case indicate vectors.
Since the above described equations 4 and 5 contain no elements of the code vector Ci, an arithmetic operation can be preliminarily performed even when the number M of patterns (size) of a coded word is large. Therefore, a higher-speed operation can be performed by the equation 6 than by the equation 3.
The process relating to the code vector Ci is performed on four samples having the amplitude of ±1.0 as described above. Accordingly, the denominator and the numerator of the equation 6 can be respectively obtained by the following equations 7 and 8.
 (DTCi)2={Σ3i=0sid(mi)}2  (7)
                                                                                                              (                                          C                      i                                        )                                    T                                ⁢                Φ                ⁢                                                                   ⁢                                  C                  i                                            =                            ⁢                                                                    ∑                                          i                      =                      0                                        3                                    ⁢                                      ϕ                    ⁡                                          (                                                                        m                          i                                                ,                                                  m                          i                                                                    )                                                                      +                                                                                                      ⁢                              2                ⁢                                                      ∑                                          i                      =                      0                                        2                                    ⁢                                                            ∑                                              j                        =                                                  i                          +                          1                                                                    3                                        ⁢                                                                  s                        i                                            ⁢                                              s                        j                                            ⁢                                              ϕ                        ⁡                                                  (                                                                                    m                              i                                                        ,                                                          m                              j                                                                                )                                                                                                                                                                            (        8        )            
where Σ3i=0 indicates the accumulation from i=0 through i=3.
The amount of operations by the equations 7 and 8 does not depend on the parameter (number of dimensions) N, and is small. Therefore, even if operations are performed the number of times corresponding to the number M of coded word patterns, the amount of the operations is not large. Therefore, with the configuration using the algebraic code book 71 shown in FIG. 3, the amount of operations can be reduced much more than with the configuration using the fixed code book 61 shown in FIG. 2. In addition, each code vector output from the algebraic code book 71 can be generated in an algebraic method according to the amplitude information (polarity information) and the position information. As a result, it is not necessary to store each code vector in the memory, thereby considerably reducing the requirements of the memory.
In the above described ACELP system, the requirements of the memory and the amount of operations can be successfully reduced. However, since the number of non-zero samples in a frame is fixed to four, and the restrictions are placed such that the positions of samples can be set at equal intervals, there is the problem that a bit rate representing the code vector index is determined according to two parameters, that is, the frame length parameter and the non-zero sample number parameter, thereby requiring a comparatively large number of bits to express a code vector index.
For example, when one frame contains 40 samples according to the standard G.729 of the ITU-T, a total of 17 bits are used as a code vector index as shown in the table 77 shown in FIG. 4. The number of the bits corresponds to 42% of the total transmission capacity (8 kbits/sec, 80 bits/10 msec) prescribed by G.729.
If one frame contains 80 samples, the number of bits required to express the position information about a non-zero sample is larger by one than in the above described case. Therefore, a total of 21 bits are used as a code vector index. The number of bits corresponds to 62.5% of the total transmission capacity prescribed by G.729, and is much larger than in one frame containing 40 samples.
Normally, to realize a very low bit rate voice CODEC at about 4 kbits/sec, a frame length should be extended. However, when the above described conventional ACELP system is applied to this requirement, there arises the problem of a considerable increase of the transmission bit rate of a code vector index. That is, the conventional ACELP system has the problem that it interrupts a demand to lower a bit rate by decreasing the number of parameter transmission bits per unit time through higher transmission efficiency.
In addition to this problem, the conventional ACELP system also has the problem that the ability to identify a pitch period shorter than a frame length is lowered when the frame length is extended.