This application is the national phase under 35 U.S.C. xc2xa7 of prior PCT International Application No. PCT/JP97/03366 which has an International filing date of Sep. 24, 1997 which designated the United States of America.
This invention relates to a method and apparatus for speech encoding, which performs compression-encoding for a speech signal to be a digital signal, and speech decoding, which performs expansion-decoding for the digital signal to be the speech signal. In addition, this invention relates to a method and apparatus for speech coding/decoding in which the speech encoding and the speech decoding are combined.
In many conventional speech coding/decoding apparatuses, an input speech is divided into spectrum-envelope information and an excitation signal. Then, the excitation signal is encoded per frame, and the encoded excitation signal is decoded to generate an output speech.
The spectrum-envelope information represents a general figure of an amplitude (power) spectrum of speech signal. The excitation signal is an energy source for generating speech. In a speech coding process and a speech synthesis, the excitation signal is represented by a form using a periodic pattern or a periodic series of pulses to be approximately shown. Many improvements have been performed especially for the method of excitation signal coding/decoding in order to enhance the quality of coding/decoding. A speech coding/decoding apparatus applying xe2x80x9ccelpxe2x80x9d (code-excited linear predictive coding) is known as the most typical speech coding/decoding apparatus.
FIG. 13 shows a whole configuration of the conventional speech coding/decoding apparatus applying celp. In FIG. 13, a coding unit 1, decoding unit 2, multiplexing unit 3, separating unit 4, input speech 5, code 6 and an output speech 7 are shown. The coding unit 1 is composed of a linear prediction analyzing unit 8, linear predictive coefficient coding unit 9, adaptive excitation coding unit 10, stochastic excitation coding unit 11 and a gain coding unit 12. The decoding unit 2 is composed of a linear predictive coefficient decoding unit 13, synthesis filter 14, adaptive excitation decoding unit 15, stochastic excitation decoding unit 16 and a gain decoding unit 17.
A speech of around 5 to 50 ms long is defined as a frame in the conventional speech coding/decoding apparatus. The speech in the frame is divided into spectrum-envelope information and an excitation signal in order to be encoded.
The operation of the conventional speech coding/decoding apparatus will now be described. First, in the coding unit 1, the linear prediction analyzing unit 8 analyzes the input speech 5, and extracts a linear predictive coefficient which is the spectrum-envelope information of the speech. The linear predictive coefficient coding unit 9 encodes the linear predictive coefficient, and outputs the encoded code to the multiplexing unit 3 as a coded linear predictive coefficient 18 for excitation signal encoding.
Referring to FIGS. 20, 21 and 22, the excitation signal encoding is now explained. As shown in FIG. 20, a plurality of old excitation signals (that is, Sold excitation signals) is stored as adaptive excitations 113 corresponding to adaptive excitation codes 111 in an adaptive excitation codebook 110 of the adaptive excitation coding unit 10. A time series vector 114 is generated by periodically repeating the adaptive excitation 113, that is the old excitation signal, corresponding to each adaptive excitation code 111. Then, a temporary synthetic signal 116 is generated by multiplying each time series vector 114 by an appropriate gain xe2x80x9cgxe2x80x9d and filtering the multiplied time series vector 114 by using a synthesis filter 115 in which the coded linear predictive coefficient 18 is used. An error signal 118 is obtained based on a differential between the temporary synthetic signal 116 and the input speech 5 to calculate the distance between the temporary synthetic signal 116 and the input speech 5. This process is repeated S times by using each adaptive excitation 113. Then, the adaptive excitation code 111 which makes the distance shortest is selected. The time series vector 114 corresponding to the selected adaptive excitation code 111 is output as the adaptive excitation 113, and one of the error signals 118 corresponding to the selected adaptive excitation code 111 is also output.
As shown in FIG. 21, a plurality of stochastic excitations 133 (that is, T stochastic excitations) corresponding to stochastic excitation codes 131 is stored in a stochastic excitation codebook 130 of the stochastic excitation coding unit 11. A temporary synthetic signal 136 is generated by multiplying each stochastic excitation 133 by the appropriate gain xe2x80x9cgxe2x80x9d and filtering the multiplied stochastic excitation 133 by using a synthesis filter 135 in which the coded linear predictive coefficient 18 is used. The distance between the temporary synthetic signal 136 and the error signal 118 is calculated. This process is repeated T times by using each stochastic excitation 133. Then, the stochastic excitation code 131 which makes the distance shortest is selected and the stochastic excitation 133 corresponding to the selected stochastic excitation code 131 is also output.
As shown in FIG. 22, a plurality of gain groups (that is, U gain groups) corresponding to gain codes 151 is stored in a gain codebook 150 of the gain coding unit 12. A gain vector 154 (g1, g2) corresponding to each gain code 151 is generated. A temporary synthetic signal 156 is generated by multiplying the adaptive excitation 113 (time series vector 114) by the element g1 of each gain vector 154 with using a multiplier 166, multiplying the stochastic excitation 133 by the element g2 of each gain vector 154 with using a multiplier 167, adding the multiplied values with using an adder 968, and filtering the added value by using a synthesis filter in which the coded linear predictive coefficient 18 is used. The distance between the temporary synthetic signal 156 and the input speech 5 is calculated. This process is repeated U times by using each gain. Then, the gain code 151 which makes the distance shortest is selected. An excitation signal 163 is generated by multiplying the adaptive excitation 113 by the element g1 of the gain vector 154 corresponding to the selected gain code 151, multiplying the stochastic excitation 133 by the element g2 of the gain vector 154 corresponding to the selected gain code 151, and adding the multiplied values. The adaptive excitation coding unit 10 updates the adaptive excitation codebook 110 by using the excitation signal 163.
The multiplexing unit 3 multiplexes the coded linear predictive coefficient 18, adaptive excitation code 111, stochastic excitation code 131 and the gain code 151 and outputs the multiplexed value as the code 6. The separating unit 4 separates the code 6 into the coded linear predictive coefficient 18, adaptive excitation code 111, stochastic excitation code 131 and the gain code 151.
In the decoding unit 2, the linear predictive coefficient decoding unit 13 decodes a linear predictive coefficient out of the coded linear predictive coefficient 18 and sets the decoded coefficient as a coefficient of the synthesis filter 14. The adaptive excitation decoding unit 15 stores old excitation signals in an adaptive excitation codebook, and outputs a time series vector 128 made by periodically repeating plural old excitation signals corresponding to an adaptive excitation code. The stochastic excitation decoding unit 16 stores plural stochastic excitations in a stochastic excitation codebook, and outputs a time series vector 148 corresponding to a stochastic excitation code. The gain decoding unit 17 stores plural gain groups in a gain codebook and outputs a gain vector 168 corresponding to a gain code. In the decoding unit 2, an excitation signal 198 is generated by multiplying the time series vector 128 by the element g1 of the gain vector, multiplying the time series vector 148 by the element g2 of the gain vector, and adding the multiplied values. This excitation signal 198 is filtered by using the synthesis filter 14 to be the output speech 7. Then, the adaptive excitation codebook in the adaptive excitation decoding unit 15 is updated by using the generated excitation signal 198.
A speech coding/decoding apparatus applying celp wherein a pulse excitation is utilized for encoding a stochastic excitation in order to mainly reduce calculation amount and memory amount, is disclosed in an article by Akitoshi Kataoka, Shinji Hayashi, Takehiro Moriya, Syoko Kurihara and Kazunori Mano entitled xe2x80x9cBasic Algorithm of Conjugate-Structure Algebraic CELP (CS-ACELP) Speech Coderxe2x80x9d in NTT RandD, Vol.45 (April 1996), pp.325-330. (This article is hereinafter called xe2x80x9carticle 1xe2x80x9d)
FIG. 14 shows the configuration of the stochastic excitation coding unit 11 used in the conventional speech coding/decoding apparatus disclosed in article 1. The whole configuration of the speech coding/decoding apparatus is the same as FIG. 13. In FIG. 14, the coded linear predictive coefficient 18, a stochastic excitation code 19 which corresponds to the stochastic excitation code 131, an encoding-target signal 20 which corresponds to the error signal 118, an impulse response calculating unit 21, a pulse position search unit 22 and a pulse position codebook 23 are shown. The encoding-target signal 20 corresponds to the error signal 118, as shown in FIG. 21, made by multiplying (the time series vector 114 of) the adaptive excitation 113 by an appropriate gain, filtering the multiplied vector by using the synthesis filter 115, and subtracting the filtered signal from the input speech 5.
FIG. 15 is the pulse position codebook 23, used in article 1, showing examples of the range and the number of bits of a pulse position code 230.
In article 1, the length of the excitation signal encoding frame is composed of 40 samples, and the stochastic excitation is composed of four pulses. As shown in FIG. 15, the pulse positions of the number 1 pulse through number 3 pulse are restricted to eight positions. Because there are eight pulse positions, 0 through 7, each of the pulse positions can be encoded by 3 bits. The pulse positions of the number 4 pulse are restricted to sixteen pulse positions. Because there are sixteen pulse positions, 0 through 15, each of the pulse positions can be encoded by 4 bits. The pulse position codes indicating the four pulse positions become a codeword of 13 bits=3+3+3+4. By virtue of restricting the pulse positions, calculation amount is decreased with suppressing the coding characteristic deterioration, because the number of bits for encoding and the number of combinations are lessened.
Referring to FIGS. 23, 24 and 25, the operation of the stochastic excitation coding unit 11 in the above conventional speech coding/decoding apparatus will now be described.
The impulse response calculating unit 21 generates an impulse signal 210 as shown in FIG. 25, in an impulse signal generating unit 218. An impulse response 214 for the impulse signal 210 is calculated by using a synthesis filter 211 whose filter coefficient is the coded linear predictive coefficient 18.
A perceptual weighting unit 212 performs a perceptual weighting process for the impulse response 214, and outputs a perceptually weighted impulse response 215. The pulse position search unit 22 reads a pulse position (ex. [25, 16, 2, 34] in FIG. 15) stored in the pulse position codebook 23 one by one. The pulse position corresponds to a pulse position code 230 shown in FIG.15 (ex [5,3, 0, 14] in FIG. 23). A temporary pulse excitation 172 is generated by setting pulses having a fixed amplitude and an appropriate sign based on sign information 231 (ex.[0,0,1,1]:1 indicates positive, 0 indicates negative) at the read pulse positions ([25,16,2,34]) of a specific number (four). A temporary synthetic signal 174 is generated by convolutionally calculating the temporary pulse excitation 172 and the impulse response 215. Then the distance between the temporary synthetic signal 174 and the encoding-target signal 20 is calculated. This calculation is performed 8192 times (8xc3x978xc3x978xc3x9716) for all the combinations of the pulse positions. One of the pulse position codes 230 (ex. [5,3,0,14]) which makes the distance shortest is combined with the sign information 231 (ex. [0,0,1,1]) for each pulse. Then, the combined value is output as the stochastic excitation code 19 which corresponds to the stochastic excitation code 131 in FIG. 13. The temporary pulse excitation 172 (which corresponds to the stochastic excitation 133 in FIG. 13) corresponding to the selected pulse position code 230 is output to the gain coding unit 12 in the coding unit 1.
In article 1, the temporary pulse excitation 172 and the temporary synthetic signal 174 are not actually generated, but a correlation function between an impulse response and the encoding-target signal 20, and a mutual correlation function between impulse responses are calculated in advance for the purpose of reducing the calculation amount at the pulse position search unit 22. Calculation for obtaining the distance is performed by simply adding these calculated results of the correlation functions.
The distance calculation method will now be explained. To get the shortest distance is equivalent to get the largest D in the following expression (1). The shortest distance is searched by performing the calculation of D for all the combinations of pulse positions.                     D        =                              C            2                    E                                    (        1        )                                C        =                              ∑            k                    ⁢                                    g              ⁡                              (                k                )                                      ⁢                          d              ⁡                              (                                  m                  ⁡                                      (                    k                    )                                                  )                                                                        (        2        )                                E        =                              ∑            k                    ⁢                                    ∑              i                        ⁢                          xe2x80x83                        ⁢                                          g                ⁡                                  (                  k                  )                                            ⁢                              g                ⁡                                  (                  i                  )                                            ⁢                              φ                ⁡                                  (                                                            m                      ⁡                                              (                        k                        )                                                              ,                                          m                      ⁡                                              (                        i                        )                                                                              )                                                                                        (        3        )            
m(k): pulse position of kth pulse
g(k): pulse amplitude of kth pulse
d(x): correlation between impulse response and input speech when an impulse is set at pulse position x
xcfx86(x,y): correlation between an impulse response when an impulse is set at pulse position x and an impulse response when an impulse is set at pulse position y
In the pulse position search unit 22 of article 1, the expressions (2) and (3) are simplified by defining that g(k) has the same sign (positive or negative) as d(m(k)) and the absolute value of g(k) is 1. Then, the simplified expressions (2) and (3) become as follows:                     C        =                              ∑            k                    ⁢                                    d              xe2x80x2                        ⁡                          (                              m                ⁡                                  (                  k                  )                                            )                                                          (        4        )                                E        =                              ∑            k                    ⁢                                    ∑              i                        ⁢                          xe2x80x83                        ⁢                                          φ                xe2x80x2                            ⁡                              (                                                      m                    ⁡                                          (                      k                      )                                                        ,                                      m                    ⁡                                          (                      i                      )                                                                      )                                                                        (        5        )            xe2x80x83dxe2x80x2(m(k))=|d(m(k))|xe2x80x83xe2x80x83(6)
xcfx86xe2x80x2(m(k),m(i))=sign[g(k)]sign[g(i)]xcfx86(m(k),m(i))xe2x80x83xe2x80x83(7)
If dxe2x80x2 and xcfx86xe2x80x2 are calculated in advance of beginning the calculation of D for all the pulse position combinations, D is obtained by only performing a small amount of calculation, that is simply adding by the expressions (4) and (5).
FIG. 16 is an illustration explaining the temporary pulse excitation 172 generated in the pulse position search unit 22. A sign of a pulse is defined depending on whether the correlation d(x) shown in (a) of FIG. 16 is positive or negative. The amplitude of the pulse is fixed to be 1. In the case that d(m(k)) is positive, a pulse hose amplitude is (+1) is set at the pulse position m(k). In the case that d(m(k)) is negative, a pulse whose amplitude is (xe2x88x921) is set at the pulse position m(k). (b) of FIG. 16 shows the temporary pulse excitation 172 corresponding to the d(x) in (a) of FIG. 16.
The pulse excitation wherein high speed search can be performed by restricting the pulse positions is called xe2x80x9cExcitation Signal applying Algebraic Codexe2x80x9d. This pulse excitation is hereinafter called xe2x80x9calgebraic excitationxe2x80x9d. A speech coding/decoding apparatus applying the algebraic code for improving the speech coding characteristic is disclosed in an article by Kazunori Ozawa, Shinichi Taumi, and Toshiyuki Nomura entitled xe2x80x9cMP-CELP Speech Coding based on Multi-Pulse Vector Quantization and Fast Searchxe2x80x9d represented in theses by the Institute of Electronics, Information and Communication Engineers, Vol.J79-A, No.10 (October 1996), pp.1655-1663. (This article is hereinafter called xe2x80x9carticle 2xe2x80x9d)
FIG. 17 shows the whole configuration of this conventional speech coding/decoding apparatus. In FIG. 17, a mode identifying unit 24, first pulse excitation coding unit 25, first gain coding unit 26, second pulse excitation coding unit 27, second gain coding unit 28, first pulse excitation decoding unit 29, first gain decoding unit 30, second pulse excitation decoding unit 31 and a second gain decoding unit 32 are shown. Reference numbers in FIG. 17 labeled correspondingly to FIG. 13 are omitted.
Comparing with FIG. 13, operations of newly added configurations in the speech coding/decoding apparatus will be described below.
The mode identifying unit 24 identifies a mode for excitation signal encoding based on an average pitch predictive gain, that is the rate of periodicity, and outputs the identification result as mode information. When the pitch periodicity is high, excitation signal coding is performed by using the first excitation signal coding mode meaning the adaptive excitation coding unit 10, the first pulse excitation coding unit 25 and the first gain coding unit 26. When the pitch periodicity is low, excitation signal coding is performed by using the second excitation signal coding mode meaning the second pulse excitation coding unit 27 and the second gain coding unit 28.
The first pulse excitation coding unit 25 generates a temporary pulse excitation corresponding to each pulse excitation code. Then, the temporary pulse excitation and an adaptive excitation output from the adaptive excitation coding unit 10 are multiplied by an appropriate gain. The multiplied signals are filtered by using a synthesis filter, in which a linear predictive coefficient output from the linear predictive coefficient coding unit 9 is used, in order to generate a temporary synthetic signal. A distance between the temporary synthetic signal and the input speech 5 is calculated, and pulse excitation code candidates are searched in the order of distance from the shortest to the farthest. A temporary pulse excitation corresponding to each pulse excitation code candidate is output.
The first gain coding unit 26 generates a gain vector corresponding to each gain code. Then, the adaptive excitation and the temporary pulse excitation are multiplied by each element of each gain vector, and the multiplied signals are added. The added signal is filtered by using a synthesis filter, in which a linear predictive coefficient output from the linear predictive coefficient coding unit 9 is used, in order to generate a temporary synthetic signal. A distance between the temporary synthetic signal and the input speech 5 is calculated. The temporary pulse excitation code and the gain code, which make the distance shortest, are selected. The selected gain code and a pulse excitation code corresponding to the selected temporary pulse excitation are output.
The second pulse excitation coding unit 27 generates a temporary pulse excitation corresponding to each pulse excitation code. Then, the temporary pulse excitation is multiplied by an appropriate gain. The multiplied temporary pulse excitation is filtered by using the synthesis filter, in which a linear predictive coefficient output from the linear predictive coefficient coding unit 9 is used, in order to generate a temporary synthetic signal. A distance between the temporary synthetic signal and the input speech 5 is calculated. The pulse excitation code makes the distance shortest is selected. In addition, pulse excitation code candidates are searched in the order of distance from the shortest to the farthest. A temporary pulse excitation corresponding to each pulse excitation code candidate is output.
The second gain coding unit 28 generates a temporary gain value corresponding to each gain code. Then, the temporary pulse excitation is multiplied by each gain value. The multiplied signal is filtered by using the synthesis filter, in which a linear predictive coefficient output from the linear predictive coefficient coding unit 9 is used, in order to generate a temporary synthetic signal. A distance between the temporary synthetic signal and the input speech 5 is calculated. A temporary pulse excitation and a gain code which make the distance shortest are selected. The selected gain code and a pulse excitation code corresponding to the selected temporary pulse excitation are output.
The multiplexing unit 3, in the case of the first excitation signal coding mode being used, multiplexes a linear predictive coefficient code, mode information, an adaptive excitation code, a pulse excitation code and a gain code, and outputs the multiplexed value as the code 6. In the case of the second excitation signal coding mode being used, the multiplexing unit 3 multiplexes the linear predictive coefficient code, the mode information, the pulse excitation code and the gain code, and outputs the multiplexed value as the code 6.
The separating unit 4, when the mode information is in the first excitation signal coding mode, separates the code 6 into the linear predictive coefficient code, the mode information, the adaptive excitation code, the pulse excitation code and the gain code. When the mode information is in the second excitation signal coding mode, the separating unit 4 separates the code 6 into the linear predictive coefficient code, the mode information, the pulse excitation code and the gain code.
In the case that the mode information is in the first excitation signal coding mode, the first pulse excitation decoding unit 29 outputs a pulse excitation corresponding to the pulse excitation code, and the first gain decoding unit 30 outputs a gain vector corresponding to the gain code. An excitation signal is generated in the decoding unit 2 by multiplying an output from the adaptive excitation decoding unit 15 by an element of the gain vector, multiplying the pulse excitation by the other element of the gain vector, and adding the multiplied values. This excitation signal is filtered by using the synthesis filter 14 to be the output speech 7.
In the case that the mode information is in the second excitation signal coding mode, the second pulse excitation decoding unit 31 outputs a pulse excitation corresponding to the pulse excitation code, and the second gain decoding unit 32 outputs a gain value corresponding to the gain code. An excitation signal is generated in the decoding unit 2 by multiplying the pulse excitation by the gain value. This excitation signal is filtered by using the synthesis filter 14 to be the output speech 7.
FIG. 18 shows the configuration of the first pulse excitation coding unit 25 or the second pulse excitation coding unit 27 in the above speech coding/decoding apparatus. In FIG. 18, a coded linear predictive coefficient 33, a pulse excitation code candidate 34, an encoding-target signal 35, an impulse response calculating unit 36, a pulse position candidate search unit 37, a pulse amplitude candidate search unit 38 and a pulse amplitude codebook 39 are shown.
The encoding-target signal 35, in the first pulse excitation coding unit 25, indicates a signal obtained by multiplying an adaptive excitation by an appropriate gain and subtracting the multiplied signal from the input speech 5. The encoding-target signal 35, in the second pulse excitation coding unit 27, indicates the input speech 5 itself The pulse position codebook 23 is the same as shown in FIGS. 14 and 15.
The impulse response calculating unit 36 calculates an impulse response of a synthesis filter whose filter coefficient is the coded linear predictive coefficient 33, and performs a perceptual weighting process for the impulse response. When the adaptive excitation code obtained in the adaptive excitation coding unit 10, that is a pitch period length, is shorter than a (sub)frame length being a basic unit for excitation signal coding, the above impulse response is filtered through a pitch filter.
The pulse position candidate search unit 37 reads a pulse position stored in the pulse position codebook 23 one by one, and generates a temporary pulse excitation by setting a pulse which has a fixed amplitude and an appropriate sign, at the read pulse positions of specific number. A temporary synthetic signal is generated by convolutionally calculating the temporary pulse excitation and the impulse response. Then, a distance between the temporary synthetic signal and the encoding-target signal 35 is calculated. Some combinations of pulse position candidates are searched in the order of distance from the shortest to the farthest, and output. However, similar to article 1, the temporary excitation signal and the temporary synthetic signal are not actually generated, but a correlation function between an impulse response and the encoding-target signal 35, and a mutual correlation function between impulse responses are calculated in advance. The calculation for obtaining the distance is performed by simply adding these calculated results of the correlation functions. The pulse amplitude candidate search unit 38 reads a pulse amplitude vector in the pulse amplitude codebook 39 one by one, calculates D in the expression (1) by using each of the pulse position candidates and this pulse amplitude vector. Then, some combinations of pulse position candidate and pulse amplitude candidate are selected in order of the value of D, from large to small, and output as the pulse excitation candidates 34.
FIG. 19 is an illustration explaining a temporary pulse excitation generated in the pulse position candidate search unit 37, and a temporary pulse excitation to which a pulse amplitude is added in the pulse amplitude candidate search unit 38. (a) and (b) of FIG. 19 are the same as (a) and (b) of FIG. 16. (c) of FIG. 19 shows a result of an amplitude being added to the temporary excitation signal, by using a pulse amplitude vector, in the pulse amplitude candidate search unit 38.
A conventional speech coding/decoding apparatus, in which encoding information amount of algebraic excitation is effectively reduced, is disclosed in an article by Hiroyuki Ehara, Kouji Yoshida, and Toshio Yagi, entitled xe2x80x9cA Study on Phase Adaptive Pulse-Search in CELP Codingxe2x80x9d in Japan Acoustic Association Theses, Vol.1 (September 1996), pp.273-274. (This article is hereinafter called xe2x80x9carticle 3 xe2x80x9d) In article 3, an algebraic excitation is made to form pitch periods, by using an adaptive excitation code indicating pitch period length. Then, the amount of information for pulse position is reduced by taking a rarely selected pulse position away, depending upon the fact that when a timewise lag (phase) of the algebraic excitation is adapted based on peak position information of a pitch waveform of an adaptive excitation, pulse positions of the algebraic excitation are not uniformly selected.
A conventional speech coding/decoding apparatus, in which the amount of necessary information for an excitation signal is reduced by making the excitation signal composed of plural pulses form pitch periods, is disclosed in an article by Kazunori Ozawa and Suguru Kouseki, entitled xe2x80x9c4.8 kb/s Multi-pulse Excited Speech Coderxe2x80x9d in Japan Acoustic Association Theses, Vol.1 (September 1985), pp.203-204. (This article is hereinafter called xe2x80x9carticle 4 xe2x80x9d)
In article 4, a frame is divided into subframes per pitch period, an excitation signal of each subframe is represented by pulses of a specific number, and one subframe in the frame is selected. An excitation signal of the whole frame is generated to form as the pulse excitation of the selected subframe is pitch-periodically repeated. Then, one of the subframes, which generates the best synthetic signal as the whole frame, is chosen as a selected period, and the pulse information of the selected period is encoded. The number of pulses in one frame is fixed to be four so as to fix the information amount of excitation signal coding in each frame.
A conventional speech coding/decoding apparatus, where the quality of representing excitation is improved by giving characteristics of phase and excitation signal wave to the pulse excitation, is disclosed in an article by Shigeru Hosoi, Yoshio Sato, and Tadayoshi Makino, entitled xe2x80x9cA Study on Source of Pulse Excitation Codingxe2x80x9d represented in the theses A-254 by the Institute of Electronics, Information and Communication Engineers, (March 1992), (This article is hereinafter called xe2x80x9carticle 5 xe2x80x9d), and in an article by Tadashi Yamaura, and Shinya Takahashi, entitled xe2x80x9cImproving the Quality of CELP Coder at Low Bit Ratesxe2x80x9d represented in the theses by Japan Acoustic Association Vol.1 (October, November 1994), pp.263, 264. (This article is hereinafter called xe2x80x9carticle 6 xe2x80x9d)
In article 5, a fixed excitation signal wave characteristic is added to a pulse excitation. This is described to be xe2x80x9cpulse waveformxe2x80x9d in article 5. An excitation signal of (sub)frame long is generated by repeating the excitation signal wave with a (pitch) period of longtime predictive delay. An excitation signal gain and an excitation signal wave head position, which make a distortion between a synthetic signal based on the generated excitation signal and an input speech minimum, are searched, and the searching result is encoded.
In article 6, a quantized phase amplitude characteristic is added to an adaptive excitation and a pulse excitation. A filter coefficient for adding the phase amplitude characteristic stored in a phase amplitude characteristic codebook is read one by one. Filtering for adding the phase amplitude characteristic and synthesizing is performed for the excitation signal of a frame long which is obtained by adding the pulse excitation and adaptive excitation repeated with lag (pitch) period of the adaptive excitation. Then, a phase amplitude characteristic code, an adaptive excitation code and a pulse excitation code for the phase amplitude characteristic filter coefficient and the excitation signal, which make the distance between the obtained synthetic signal and the input speech shortest, are output.
A conventional speech coding/decoding apparatus, in which coding quality performed between voiced sounds is improved by using a stochastic codebook partially containing an excitation signal made of a series of pulses, is disclosed in an article by Gao Yang, H. Leich, and R. Boite, entitled xe2x80x9cA Very High-Quality Celp Coder at the Rate of 2400 bpsxe2x80x9d in EUROSPEECH ""91, pp.829-832. (This article is hereinafter called xe2x80x9carticle 7xe2x80x9d)
In article 7, one excitation signal codebook is composed of a series of pulses repeated with a pitch period (lag length of adaptive excitation), a series of pulses repeated with a half pitch period, and a noise whose biggest part is made up to be zero (sparse).
The conventional speech coding/decoding apparatuses disclosed in the above articles 1 through 7 have the following problems.
In the speech coding/decoding apparatus of article 1, a temporary excitation signal is generated by setting a pulse which has a fixed amplitude and an appropriate sign, and the search of the pulse position is performed. Therefore, in the case of giving an independent gain (amplitude) to each pulse for the purpose of improving, an approximation to get the fixed amplitude enormously effects on the searching result. Consequently, there is a problem that the most appropriate pulse position can not be found.
In order to suppress the effect of the approximation, the method of keeping plural pulse position candidates is applied in article 2. The method is done by selecting the most appropriate pulse position based on a combination of each pulse position candidate with a pulse amplitude candidate. However, here is a problem that calculation amount is increased.
In the speech coding/decoding apparatus disclosed in article 2, determining which mode to be used between the first excitation signal coding mode that performs encoding by adding the adaptive excitation and the algebraic excitation, and the second excitation signal coding mode that performs encoding only using the algebraic excitation, depends upon the rate of pitch periodicity. However, there is a case that using the adaptive excitation is desirable even though the pitch periodicity is low, or using only the algebraic excitation for encoding is desirable even though the pitch periodicity is high. Namely, there exists the problem that mode identification for getting the best coding characteristic can not be performed.
As an example of the case that using the adaptive excitation is desirable even though the pitch periodicity is low, there is a case that it is difficult to satisfactorily represent an excitation signal when the pitch period is short and the number of pulses having the algebraic excitation is small. The less amount of excitation signal encoding information becomes or the less the number of pulses becomes, the more this tendency becomes. As an example of the case that using only the algebraic excitation for encoding is desirable even though the pitch periodicity is high, there is a case that it is possible to satisfactorily represent an excitation signal even when the pitch period is long and the number of pulses of the algebraic excitation is small. As known from these examples, it is necessary to adaptively change the threshold for determining the mode depending upon the pitch period and the number of pulses. However, in the speech coding/decoding apparatus of article 2, there is a problem that determining the mode for getting the best coding characteristic cannot be performed because it is not adaptively processed.
In the speech coding/decoding apparatus disclosed in article 3, the algebraic excitation is made to form pitch periods. However, it is necessary to certainly use both the adaptive excitation and the algebraic excitation because the pitch period is based on an adaptive excitation code. Consequently, there is a problem that the speech coding characteristic is deteriorated at the part where the adaptive excitation having bad coding characteristic is applied. For example, when excitation signal pitch periodicity of the present frame is high but an excitation signal of previous frame does not resemble the excitation signal of present frame, it is desirable that the algebraic excitation is made to form pitch periods though the efficiency of the adaptive excitation is bad.
Even when the coding is performed for the above part by using the second excitation signal coding mode, which encodes the excitation signal by using only the algebraic excitation, as shown in article 2, the problem of bad coding characteristic still exists because the algebraic excitation is not made to form pitch periods. The method of separately encoding the pitch period can be a way of making the algebraic excitation in article 2 form pitch periods. However, there is a problem that the quality is deteriorated because information amount needed for encoding the pitch period is large and the number of pulses is small.
In the speech coding/decoding apparatus disclosed in article 3, information amount for the pulse position is reduced by taking a rarely selected pulse position away. However, when the pitch period is short, there is useless information in the coding information because a pulse position which is never used exists.
In the speech coding/decoding apparatus disclosed in article 4, pulse information of a subframe whose pitch period length represents a frame is encoded, and the pulse excitation is made to form pitch periods. However, there is also useless information in the coding information, similar to the case of article 3, because a method of encoding pulse positions for a wide encoding range is always used even when the pitch period is short and encoding range for pulse positions is small.
In the speech coding/decoding apparatus disclosed in article 5, an excitation signal of (sub)frame long is generated by repeating a fixed excitation signal wave with a pitch period. An excitation signal gain and an excitation signal wave head position, which make the distortion of a synthetic signal based on the generated excitation signal and an input speech minimum, are searched. However, the calculation amount necessary for calculating the distance at each head position of the excitation signal wave is large. According to some conditions, it may be one hundred times as much as the calculation order amount in article 1. Therefore, it is necessary to keep the number of combinations of excitation signal positions small (equal to or less than one hundred) as disclosed in article 5, in order to process within a practical time. Namely, when the number of excitation signal combinations, by which an excitation signal position of each pitch period long can be separately determined, is large (equal to or more than ten thousand), there is a problem that it is impossible to process within the practical time.
In the speech coding/decoding apparatus disclosed in article 6, a quantized phase amplitude characteristic is added to the adaptive excitation and the pulse excitation. Similar to the case in article 5, however, distance calculation amount at an excitation signal position is large. Therefore, when the number of combinations of pulse positions becomes large, searching calculation amount proportionally increases. Consequently, there is a problem that it is impossible to process within the practical time.
In the speech coding/decoding apparatus disclosed in article 7, coding quality performed between voiced sounds is improved by using the stochastic codebook partially containing an excitation signal made of a series of pulses. However, it is only possible to represent a series of pulses repeated with a pitch period, a series of pulses with a half pitch period, and a sparse noise. As only specific excitation signals can be represented, there is a problem that coding characteristic is deteriorated depending upon the input speech. In addition, it is necessary for the number of codes to be the same as the number of excitation signal samples, that means the number of pulse head positions in the series of periodic pulse excitations. Namely, there is a problem that a part cannot be series of pulse excitations in a small-sized codebook.
In order to solve the above problems, this invention provides a speech coding apparatus, a speech decoding apparatus and a speech coding/decoding apparatus in which the coding characteristic, at the time of an input speech being divided into spectrum-envelope information and an excitation signal to perform encoding per frame, is greatly improved.
A speech coding apparatus according to the present invention, which separates an input speech into spectrum-envelope information and an excitation signal, and encodes the excitation signal at each frame, comprises
an excitation signal coding unit (11, 12) for encoding the excitation signal based on a plurality of excitation signal positions and a plurality of excitation signal gains. The excitation signal coding unit (11, 12) includes
a temporary gain calculating unit (40) for calculating a temporary gain for each of excitation signal position candidates,
an excitation signal position search unit (41) for determining each of the plurality of excitation signal positions based on the temporary gain, and
a gain coding unit (12) for encoding the plurality of excitation signal gains based on each of the plurality of excitation signal positions.
A speech coding/decoding apparatus according to the present invention has a coding unit (1) for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, and a decoding unit (2) for generating an output speech by decoding an encoded excitation signal. The coding unit (1) of the speech coding/decoding apparatus comprises
an excitation signal coding unit (11, 12) for encoding the excitation signal based on a plurality of excitation signal positions and a plurality of excitation signal gains. The excitation signal coding unit (11, 12) includes
a temporary gain calculating unit (40) for calculating a temporary gain for each of excitation signal position candidates,
an excitation signal position search unit (41) for determining each of the plurality of excitation signal positions based on the temporary gain, and
a gain coding unit (12) for encoding the plurality of excitation signal gains based on each determined excitation signal position.
The decoding unit (2) of the speech coding/decoding apparatus comprises
an excitation signal decoding unit (16,17) for generating an excitation signal by decoding the plurality of excitation signal positions and the plurality of excitation signal gains.
A speech coding apparatus according to the present invention separates an input speech into spectrum-envelope information and an excitation signal, and encodes the excitation signal at each frame. The speech coding apparatus comprises
an impulse response calculating unit (21) for calculating an impulse response of a synthesis filter, based on the spectrum-envelope information,
a phase adding filter (42) for giving a specific excitation signal phase characteristic to the impulse response, and
an excitation signal coding unit (22, 12) for encoding the excitation signal into a plurality of pulse excitation positions and a plurality of excitation signal gains, by using the impulse response to which the specific excitation signal phase characteristic has been added.
A speech coding/decoding apparatus according to the present invention has a coding unit (1) for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, and a decoding unit (2) for generating an output speech by decoding an encoded excitation signal. The coding unit (1) of the speech coding/decoding apparatus comprises
an impulse response calculating unit (21) for calculating an impulse response of a synthesis filter, based on the spectrum-envelope information,
a phase adding filter (42) for giving a specific excitation signal phase characteristic to the impulse response, and
an excitation signal coding unit (22, 12) for encoding the excitation signal into a plurality of pulse excitation positions and a plurality of excitation signal gains, based on the impulse response to which the specific excitation signal phase characteristic has been added. The decoding unit (2) of the speech coding/decoding apparatus comprises
an excitation signal decoding unit (16,17) for generating an excitation signal by decoding the plurality of pulse excitation positions and the plurality of excitation signal gains.
A speech coding apparatus according to the present invention separates an input speech into spectrum-envelope information and an excitation signal, and encodes the excitation signal at each frame. The speech coding apparatus comprises
an excitation signal coding unit (11, 12) for encoding the excitation signal based on a plurality of pulse excitation positions and a plurality of excitation signal gains. The excitation signal coding unit (11, 12) includes
a plurality of excitation signal position candidate tables (51, 52), one of which is selected to be used when the pitch period is equal to or less than a specific value.
A speech decoding apparatus according to the present invention which generates an output speech by decoding an excitation signal encoded at each frame, comprises
an excitation signal decoding unit (16, 17) for generating an excitation signal by decoding a plurality of pulse excitation positions and a plurality of excitation signal gains. The excitation signal decoding unit (16, 17) includes
a plurality of excitation signal position candidate tables (55, 56), one of which is selected to be used when the pitch period is equal to or less than a specific value.
A speech coding/decoding apparatus according to the present invention has a coding unit (1) for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, and a decoding unit (2) for generating an output speech by decoding an encoded excitation signal. The coding unit (1) of the speech coding/decoding apparatus comprises
an excitation signal coding unit (11, 12) for encoding the excitation signal based on a plurality of pulse excitation positions and a plurality of excitation signal gains. The excitation signal coding unit (11, 12) includes
a plurality of excitation signal position candidate tables (51, 52), one of which is selected to be used when the pitch period is equal to or less than a specific value.
The decoding unit (2) of the speech coding/decoding apparatus comprises
an excitation signal decoding unit (16, 17) for generating an excitation signal by decoding a plurality of pulse excitation positions and a plurality of excitation signal gains. The excitation signal decoding unit (16, 17) includes
a plurality of excitation signal position candidate tables (55, 56), one of which is selected to be used when the pitch period is equal to or less than a specific value.
A speech coding apparatus separates an input speech into spectrum-envelope information and an excitation signal, and encodes the excitation signal at each frame.
The speech coding apparatus comprises
an excitation signal coding unit (11, 12) for encoding an excitation signal of a pitch period long based on a plurality of pulse excitation positions and a plurality of excitation signal gains. A code indicating a pulse excitation position (300) more than a pitch period is reset to indicate a pulse excitation position (310) within a range of the pitch period.
A speech decoding apparatus according to the present invention, which generates an output speech by decoding an excitation signal encoded at each frame, comprises
an excitation signal decoding unit (16, 17) for generating an excitation signal of a pitch period long by decoding a plurality of pulse excitation positions and a plurality of excitation signal gains, wherein a code indicating a pulse excitation position (300) more than a pitch period is reset to indicate a pulse excitation position (310) within a range of the pitch period.
A speech coding/decoding apparatus according to the present invention has a coding unit (1) for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, and a decoding unit (2) for generating an output speech by decoding an encoded excitation signal.
The coding unit (1) of the speech coding/decoding apparatus comprises
an excitation signal coding unit (11, 12) for encoding the excitation signal of a pitch period long based on a plurality of pulse excitation positions and a plurality of excitation signal gains, wherein a code indicating a pulse excitation position (300) more than a pitch period is reset to indicate a pulse excitation position (310) within a range of the pitch period.
The decoding unit (2) of the speech coding/decoding apparatus comprises
an excitation signal decoding unit (16, 17) for generating an excitation signal of a pitch period long by decoding a plurality of pulse excitation positions and a plurality of excitation signal gains, wherein a code indicating a pulse excitation position (300) more than a pitch period is reset to indicate a pulse excitation position (310) within a range of the pitch period.
A speech coding apparatus according to the present invention separates an input speech into spectrum-envelope information and an excitation signal, and encodes the excitation signal at each frame. The speech coding apparatus comprises
a first excitation signal coding unit (10, 11, 12) for encoding the excitation signal based on a plurality of pulse excitation positions and a plurality of excitation signal gains,
a second excitation signal coding unit (57, 58) different from the first excitation signal coding unit, and
a selecting unit (59) for comparing an encoding-distortion output from the first excitation signal coding unit with an encoding-distortion output from the second excitation signal coding unit, and selecting one of the first excitation signal coding unit and the second excitation signal coding unit which has a smaller encoding-distortion.
A speech coding/decoding apparatus according to the present invention has a coding unit (1) for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, and a decoding unit (2) for generating an output speech by decoding an encoded excitation signal. The coding unit (1) of the speech coding/decoding apparatus comprises
a first excitation signal coding unit (10, 11, 12) for encoding the excitation signal based on a plurality of pulse excitation positions and a plurality of excitation signal gains,
a second excitation signal coding unit (57, 58) different from the first excitation signal coding unit, and
a selecting unit (59) for comparing an encoding-distortion output from the first excitation signal coding unit with an encoding-distortion output from the second excitation signal coding unit, and selecting one of the first excitation signal coding unit and the second excitation signal coding unit which has a smaller encoding-distortion. The decoding unit (2) of the speech coding/decoding apparatus comprises
a first decoding unit (15, 16, 17) corresponding to the first excitation signal coding unit,
a second decoding unit (60, 61) corresponding to the second excitation signal coding unit, and
a controlling unit (330) for determining to use one of the first excitation signal decoding unit and the second excitation signal decoding unit based on a selection result led by the selecting unit.
A speech coding apparatus according to the present invention separates an input speech into spectrum-envelope information and an excitation signal, and encodes the excitation signal at each frame. The speech coding apparatus comprises
a plurality of excitation signal codebooks (63, 64) composed of a plurality of codewords (340) indicating excitation signal position information and a plurality of codewords (350) indicating excitation signal waveforms, wherein every excitation signal position information represented by each of the plurality of codewords, in each of the plurality of excitation signal codebooks is different, and
an excitation signal coding unit (11) for encoding the excitation signal by using the plurality of excitation signal codebooks.
In the speech coding apparatus according to the present invention, the number of the plurality of codewords (340) indicating excitation signal position information in the plurality of excitation signal codebooks (63, 64) is controlled depending upon a pitch period.
A speech decoding apparatus according to the present invention which generates an output speech by decoding an excitation signal encoded at each frame comprises
a plurality of excitation signal codebooks (63, 64) composed of a plurality of codewords (340) indicating excitation signal position information and a plurality of codewords (350) indicating excitation signal waveforms, wherein every excitation signal position information represented by each of the plurality of codewords in each of the plurality of excitation signal codebooks is different, and
an excitation signal decoding unit (16) for decoding the excitation signal by using the plurality of excitation signal codebooks.
A speech coding/decoding apparatus according to the present invention has a coding unit (1) for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, and a decoding unit (2) for generating an output speech by decoding an encoded excitation signal. The coding unit (1) of the speech coding/decoding apparatus comprises
a plurality of excitation signal codebooks (63, 64) composed of a plurality of codewords (340) indicating excitation signal position information and a plurality of codewords (350) indicating excitation signal waveforms, wherein every excitation signal position information represented by each of the plurality of codewords in each of the plurality of excitation signal codebooks is different, and
an excitation signal coding unit (11) for encoding the excitation signal by using the plurality of excitation signal codebooks. The decoding unit (2) of the speech coding/decoding apparatus comprises
a plurality of excitation signal codebooks having coincident contents with the plurality of excitation signal codebooks (63, 64), and
an excitation signal decoding unit (16) for decoding the excitation signal by using the plurality of excitation signal codebooks.
According to the present invention, a speech coding method, for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, comprises a step of
encoding the excitation signal based on a plurality of excitation signal positions and a plurality of excitation signal gains. The encoding step includes steps of
calculating a temporary gain for each of excitation signal position candidates,
searching each of a plurality of excitation signal positions based on the temporary gain, and
encoding the plurality of excitation signal gains based on each of plurality of searched excitation signal positions.
According to the present invention, a speech coding method, for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, comprises steps of
calculating an impulse response of a synthesis filter based on the spectrum-envelope information,
adding a specific excitation signal phase characteristic to the impulse response, and
encoding the excitation signal into a plurality of pulse excitation positions and a plurality of excitation signal gains, by using the impulse response to which the specific excitation signal phase characteristic has been added.
According to the present invention, a speech coding method, for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, comprises a step of
encoding the excitation signal based on a plurality of pulse excitation positions and a plurality of excitation signal gains. The encoding step including a step of
switching one of excitation signal position candidate tables to be in use, when the pitch period is equal to or less than a specific value.
According to the present invention, a speech coding method, for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, comprises a step of
encoding an excitation signal of a pitch period long, based on a plurality of pulse excitation positions and a plurality of excitation signal gains. The encoding step includes a step of
resetting a code indicating a pulse excitation position more than a pitch period to indicate a pulse excitation position within a range of the pitch period.
According to the present invention, a speech coding method, for separating an input speech into spectrum-envelope information and an excitation signal, and encoding the excitation signal at each frame, comprises steps of
encoding the excitation signal based on a plurality of pulse excitation positions and a plurality of excitation signal gains,
encoding the excitation signal differently from the said encoding step, and
selecting one of the encoding steps which has a smaller encoding-distortion by comparing encoding-distortions output in the encoding steps.
According to the present invention, a speech coding method, for separating an input speech into spectrum-envelope information and an excitation signal and encoding the excitation signal at each frame, comprises a step of
encoding the excitation signal by using a plurality of excitation signal codebooks composed of a plurality of codewords indicating excitation signal position information and a plurality of codewords indicating excitation signal waveforms, wherein every excitation signal position information represented by each of the plurality of codewords in each of the plurality of excitation signal codebooks is different.
In the speech coding apparatus according to the present invention, temporary gain calculating unit (40) selects each of the excitation signal position candidates in order to calculate the temporary gain for each selected excitation signal position candidate on a supposition that one pulse is set for the selected excitation signal position candidate at each selecting in a frame.
In the speech coding apparatus according to the present invention, the gain coding unit (12) calculates an excitation signal gain, different from the temporary gain, for each of the plurality of excitation signal positions determined by the excitation signal position search unit (41), and encodes a calculated excitation signal gain.