1. Field of the Invention
The present invention relates to a speech coding apparatus for compressing a digital speech signal to an equivalent signal having a smaller amount of information, and a speech decoding apparatus for decoding speech code generated by the speech coding apparatus or the like to reconstruct a digital speech signal.
2. Description of the Prior Art
Prior art speech coding apparatuses separate an input speech into spectral envelope information and an excitation source and encode them on a frame-by-frame basis, where each frame has a certain length, so as to generate speech code, and prior art speech decoding apparatuses decode the speech code and generate decoded speech by combining the spectral envelope information and the excitation source using a synthesis filter. Typical prior art speech coding apparatuses and speech decoding apparatuses employ a code-excited linear prediction (CELP) coding technique.
Referring now to FIG. 14, there is illustrated a block diagram showing the structure of a prior art CELP speech coding apparatus. FIG. 15 is a block diagram showing the structure of a prior art CELP speech decoding apparatus. In FIG. 14, reference numeral 1 denotes an input speech, numeral 2 denotes a linear prediction analyzer, numeral 3 denotes a linear prediction coefficient coding unit, numeral 4 denotes an adaptive excitation source coding unit, numeral 5 denotes a driving excitation source coding unit, numeral 6 denotes a gain coding unit, numeral 7 denotes a multiplexer, and numeral 8 denotes speech code. In FIG. 15, reference numeral 9 denotes a separator, numeral 10 denotes a linear prediction coefficient decoding unit, numeral 11 denotes an adaptive excitation source decoding unit, numeral 12 denotes a driving excitation source decoding unit, numeral 13 denotes a gain decoding unit, numeral 14 denotes a synthesis filter, and numeral 15 denotes output speech.
In operation, the prior art speech coding apparatus performs its coding operation on a frame-by-frame basis, where each frame has a duration ranging from 5 to 50 msec. Similarly, the prior art speech decoding apparatus performs its decoding operation on a frame-by-frame basis. In the speech coding apparatus of FIG. 14, the input speech 1 is applied to the linear prediction analyzer 2, the adaptive excitation source coding unit 4, and the gain coding unit 6. The linear prediction analyzer 2 analyzes the input speech 1 so as to extract a linear prediction coefficient that is the spectral envelope information of the input speech 1. The linear prediction coefficient coding unit 3 then encodes the linear prediction coefficient and furnishes the coded result to the multiplexer 7. The linear prediction coefficient coding unit 3 also quantizes the linear prediction and furnishes the quantized linear prediction to the adaptive excitation source coding unit 4, the driving excitation source coding unit 5, and the gain coding unit 6 for coding an excitation source separated from the input speech 1.
The adaptive excitation source coding unit 4 stores a past excitation source (or signal) of a certain length as an adaptive excitation source code book (i.e., adaptive code book) and generates a plurality of adaptive excitation source codes each of which is a multiple-bit binary value. For each of the plurality of adaptive excitation source codes, the adaptive excitation source coding unit 4 also generates a time-series vector that is a series of pitch-cycles each of which includes the past excitation source. The adaptive excitation source coding unit 4 then multiplies the plurality of time-series vectors by an appropriate gain and allows the multiplication result to pass through a synthesis filer (not shown) using the quantized linear prediction coefficient from the linear prediction coefficient coding unit 3 so as to generate a temporary synthesized speech. The adaptive excitation source coding unit 4 calculates and examines the distance between the temporary synthesized speech and the input speech 1 and selects one adaptive excitation source code which minimizes the distance from the plurality of adaptive excitation source codes. The adaptive excitation source coding unit 4 then delivers the selected adaptive excitation source code to the multiplexer 7. The adaptive excitation source coding unit 4 also furnishes the time-series vector associated with the selected adaptive excitation source code as an adaptive excitation source to the driving excitation source coding unit 5 and the gain coding unit 6. The adaptive excitation source coding unit 4 further delivers either the input speech 1 or a signal obtained by substituting synthesized speech generated from the adaptive excitation source from the input signal 1, as a signal to be coded, to the driving excitation source coding unit 5.
The driving excitation source coding unit 5 contains a driving excitation source code book and generates a plurality of driving excitation source codes each of which is a multiple-bit binary value. For each of the plurality of driving excitation source codes, the driving excitation source coding unit 5 also reads a time-series vector from the driving excitation source code book. The driving excitation source coding unit 5 then multiplies both the plurality of time-series vectors and the adaptive excitation source output from the adaptive excitation source coding unit 4 by respective appropriate gains and calculates the sum of them and allows the sum to pass through a synthesis filter (not shown) using the quantized linear prediction coefficient from the linear prediction coefficient coding unit 3 so as to generate a temporary synthesized speech. The driving excitation source coding unit 5 calculates and examines the distance between the temporary synthesized speech and the signal to be coded, which is either the input speech 1 or the signal obtained by substituting the synthesized speech generated from the adaptive excitation source from the input signal 1, and selects one driving excitation source code which minimizes the distance from the plurality of driving excitation source codes. The driving excitation source coding unit 5 then delivers the selected driving excitation source code, to the multiplexer 7. The driving excitation source coding unit 5 also furnishes the time-series vector associated with the selected driving excitation source code as a driving excitation source to the gain coding unit 6.
The gain coding unit 6 stores a gain code book therein and generates a plurality of gain codes, each of which is a multiple-bit binary value. For each of the plurality of gain codes, the gain coding unit 6 also reads a gain vector sequentially from the gain code book. The gain coding unit 6 then multiplies both the adaptive excitation source output from the adaptive excitation source coding unit 4 and the driving excitation source output from the driving excitation source coding unit 5 by two elements of the gain vector, respectively, and calculates the sum of them so as to generate an excitation source and allows the excitation source to pass through a synthesis filter (not shown) using the quantized linear prediction coefficient from the linear prediction coefficient coding unit 3 so as to generate a temporary synthesized speech. The gain coding unit 6 calculates and examines the distance between the temporary synthesized speech and the input speech 1, and selects one gain code which minimizes the distance from the plurality of gain codes. The gain coding unit 6 then delivers the selected gain code to the multiplexer 7. The gain coding unit 6 also furnishes the generated excitation source corresponding to the selected gain code to the adaptive excitation source coding unit 4.
Finally, the adaptive excitation source coding unit 4 updates the adaptive code book located therein using the excitation source corresponding to the gain code selected by the gain coding unit 6.
The multiplexer 7 multiplexes the linear prediction coefficient code from the linear prediction coefficient coding unit 3, the adaptive excitation source code from the adaptive excitation source coding unit 4, the driving excitation source code from the driving excitation source coding unit 5, and the gain code from the gain coding unit 6 into a speech code 8, and outputs the speech code 8.
In the speech decoding apparatus of FIG. 15, the separator 9 separates the speech code 8 from the speech coding apparatus into the linear prediction coefficient code, the adaptive excitation source code, the driving excitation source code, and the gain code. The separator 9 then furnishes them to the linear prediction coefficient decoding unit 10, the adaptive excitation source decoding unit 11, the driving excitation source decoding unit 12, and the gain decoding unit 13, respectively. The linear prediction coefficient decoding unit 10 decodes the linear prediction coefficient code from the separator 9 so as to reconstruct the linear prediction coefficient. The linear prediction coefficient decoding unit 10 then sets and outputs the linear prediction coefficient as a filter coefficient for the synthesis filter 14.
The adaptive excitation source decoding unit 11 stores a past excitation source as an adaptive excitation source code book. The adaptive excitation source decoding unit 11 also generates a time-series vector that is a series of pitch-cycles each of which includes the past excitation source, as an adaptive excitation source, the time-series vector being associated with the adaptive excitation source code separated by the separator 9. The driving excitation source decoding unit 12 generates a time-series vector as a driving excitation source, the time-series vector being associated with the driving excitation source code separated by the separator 9. The gain decoding unit 13 also generates a gain vector associated with the gain code separated by the separator 9. The speech decoding apparatus then multiplies both the first and second time-series vectors from the adaptive excitation source decoding unit and the driving excitation source decoding unit by two elements of the gain vector from the gain decoding unit, respectively, so as to generate an excitation source and allows the excitation source to pass through the synthesis filter 14 so as to generate output speech 15. Finally, the adaptive excitation source decoding unit 11 updates the adaptive excitation source code book located therein using the generated excitation source.
Next, a description will be made as to an improvement in the prior art CELP speech coding and decoding apparatuses mentioned above. “Basic algorithm of conjugate-structure algebraic CELP (CS-ACELP) speech coder” by A. Kataoka et al., NTT R&D, Vol. 45, April 1996, which will be referred to as Reference 1, discloses a CELP speech coding apparatus and a CELP speed decoding apparatus including a excitation source pulse for coding a driving excitation source with the aim of reducing the amount of calculations and the amount of memory. In this prior art arrangement, the driving excitation source is represented only by information about the locations of a number of pulses and information about the polarities of the plurality of pulses. Such an excitation source is called an algebraic excitation source, and provides a good coding performance considering that it has a simple structure. Recently-developed standard coding techniques adopt the algebraic excitation source.
Referring next to FIG. 16, there is illustrated a table listing candidates for the locations of the excitation source pulses employed by the CELP speech coding and decoding apparatuses disclosed in Reference 1. Such the table can be located in both the driving excitation source coding unit 5 of the speech coding apparatus as shown in FIG. 14 and the driving excitation source decoding unit 12 of the speech decoding apparatus as shown in FIG. 15. In Reference 1, the length of frames to be coded when coding excitation sources is 40 samples, and the driving excitation source consists of four pulses. Three of them numbered 1 to 3 have 8 limited possible locations as shown in FIG. 16, respectively. Therefore, each of the locations of the three pulses can be coded in three bits. The remaining pulse numbered 4 has 16 limited possible locations as shown in FIG. 16. Therefore, the location of the fourth pulse can be coded in four bits. The number of candidates for the location of each of the four excitation source pulses is limited in this way, and the amount of bits used for coding the driving excitation source and the number of combinations of the locations of those excitation source pulses are therefore reduced. This results in a reduction in the amount of arithmetic operations without reducing the coding performance.
In accordance with the coding technique as disclosed in Reference, the driving excitation source coding unit 5 of the speech coding apparatus of FIG. 14 calculates a correlation between an impulse response (i.e., a synthesized speech generated by a single excitation source pulse) and a signal to be coded, and a cross-correlation between impulse responses (i.e., synthesized speeches respectively generated by single excitation source pulses), and stores them as a pre-table therein and calculates the distance (or coding distortion) by simply calculating the sum of them. The driving excitation source coding unit 5 then searches for the pulse locations and polarities that minimize the distance.
The concrete searching method as disclosed in Reference 1 will be described hereinafter. The minimization of the distance is equivalent to the maximization of an evaluation value D given by the following equation:D=C2/E  (1)where C and E are given by:
                    C        =                              ∑            k                    ⁢                                    g              (              k              )                        ⁢                          d              (                              m                                  k                  ⁢                                                                                                    )                                                          (        2        )                                E        =                              ∑            k                    ⁢                                    ∑              i                        ⁢                                          g                ⁡                                  (                  k                  )                                            ⁢                              g                ⁡                                  (                  i                  )                                            ⁢                              ϕ                ⁡                                  (                                                            m                      k                                        ,                                          m                      i                                                        )                                                                                        (        3        )            where mk is the location of the kth pulse, g(k) is the magnitude of the kth pulse, d(x) is the correlation between an impulse response generated when an impulse is placed at the pulse position x and the signal to be coded, and φ(x,y) is the cross-correlation between an impulse response generated when an impulse is placed at the pulse location x and an impulse response generated when an impulse is placed at the pulse location y. The searching process is carried out by the calculation of the evaluation value D for all combinations of the possible locations of all excitation source pulses.
In addition, simplifying the above equations (2) and (3) by assuming that g(k) has the same sign as d(mk) and has an absolute value of 1 yields the following equations (4) and (5):
                    C        =                              ∑            k                    ⁢                                    d              ′                        ⁡                          (                              m                k                            )                                                          (        4        )                                E        =                              ∑            k                    ⁢                                    ∑              i                        ⁢                                          ϕ                ′                            ⁡                              (                                                      m                    k                                    ,                                      m                    i                                                  )                                                                        (        5        )            whered′(mk)=|d(mk)|  (6)φ′(mk, mi)=sign[d(mk)]sign[d(mi)]φ(mk, mi)  (7)Only calculating d′(mk) and φ′(mk, mi) in advance of the calculation of the evaluation value D for all combinations of the locations of all excitation source pulses is thus needed before the simple summations according to the equations (4) and (5), thereby reducing the amount of arithmetic operations.
Japanese patent application publications (TOKKAIHEI) No. 10-232696 and No. 10-312198, and “Improvements in ACELP speech coding based on adaptive pulse locations”, by Tsuchiya et al., Nihon Onkyo Gakkai (The Acoustical Society of Japan) 1999 Shunki Kenkyuu Happyokai Kouen Ronbunshuu vol.I, pp. 213–214, 1999, which will be referred to as Reference 2, disclose configurations for improving the quality of the algebraic excitation source mentioned above.
Japanese patent application publication No. 10-232696 discloses a method of providing a plurality of fixed waveforms and generating a driving excitation source by placing the plurality of fixed waveforms at a plurality of locations coded algebraically, respectively, thereby yielding an output speech with a high quality. Reference 2 studies an arrangement in which a pitch filter is contained in a generating unit for generating a driving excitation source (in reference 2, an ACELP excitation source). Either of the arrangement of the plurality of fixed waveforms and the pitch-filtering process to generate a pitch-filtered driving excitation source can improve the quality of the output speech without increasing the amount of searching operations if it is carried out at the same time that the calculation of impulse responses is done.
Japanese patent application publication No. 10-312198 discloses an arrangement in which the locations of excitation sources pulses are searched for while the driving excitation source is made to be orthogonal to the adaptive excitation source when the pitch gain is greater than or equal to a predetermined value.
Referring next to FIG. 17, there is illustrated a block diagram showing in details the structure of a driving excitation source coding unit 5 of an improved CELP speech coding apparatus disclosed in Japanese patent application publication No. 10-232696 and Reference 2. In the figure, reference numeral 16 denotes a perceptual weighting filter coefficient calculating unit, numerals 17 and 19 denote perceptual weighting filters, numeral 18 denotes a basic response generating unit, numeral 20 denotes a pre-table calculating unit, numeral 21 denotes a searching unit, and numeral 22 denotes an excitation source location table.
Next, the operation of the driving excitation source coding unit 5 will be described. A quantized linear prediction coefficient from a linear prediction coefficient coding unit 3 disposed within the speech coding apparatus as shown in FIG. 14 is applied to the perceptual weighting filter coefficient calculating unit 16 and the basic response generating unit 18. An adaptive excitation source coding unit 4 furnishes a signal to be coded that is either an input speech 1 or a signal obtained by substituting synthesized speech generated from an adaptive excitation source from the input speech 1 to the perceptual weighting filter 17. The adaptive excitation source coding unit 4 also delivers the repetition period of the adaptive excitation source converted from an adaptive excitation source code to the basic response generating unit 18.
The perceptual weighting filter coefficient calculating unit 16 then calculates a perceptual weighting filter coefficient using the quantized linear prediction coefficient and sets the calculated perceptual weighting filter coefficient as a filter coefficient intended for the perceptual weighting filters 17 and 19. The perceptual weighting filter 17 performs a filtering process on the input signal to be coded using the filter coefficient set by the perceptual weighting filter coefficient calculating unit 16.
The basic response generating unit 18 performs pitch filtering on a unit impulse or a fixed waveform using the repetition period of the adaptive excitation source so as to generate a series of cycles each of which includes the unit impulse or the fixed waveform, the repetition period of the series of cycles being equal to that of the adaptive excitation source. The basic response generating unit 18 then allows the generated signal, as an excitation source, to pass through a synthesis filter formed using the quantized linear prediction coefficient to generate synthesized speech, and outputs the synthesized speech as a basic response. The perceptual weighting filter 19 performs a filtering process on the basis response using the filter coefficient set by the perceptual weighting filter coefficient calculating unit 16.
The pre-table calculating unit 20 calculates the correlation d(x) between the perceptual weighted signal to be coded and the perceptual weighted basic response when placing the impulse at the location x, and calculates the cross-correlation φ(x,y) between the perceptual weighted basic response when placing the impulse at the location x and the perceptual weighted basic response when placing the impulse at the location y. The pre-table calculating unit 20 then obtains d′(x) and φ′(x,y) according to equations (6) and (7) and stores them as a pre-table.
The excitation source location table 22 stores a plurality of candidates for the locations of excitation source pulses, which are similar to those as shown in FIG. 16. The searching unit 21 sequentially reads each of all combinations of the possible locations of the excitation source pulses from the excitation source location table 22 and calculates an evaluation value D for each combination of the possible locations of the excitation source pulses using the pre-table calculated by the pre-table calculating unit 20 according to above-mentioned equations (1), (4) and (5). The searching unit 21 also searches for one combination of the possible locations of the excitation source pulses which maximizes the evaluation value D and furnishes excitation source location code (i.e., indexes of the excitation source location table) indicating the combination of the possible locations of the excitation source pulses and polarity code indicating the polarities of them, as driving excitation source code, to a multiplexer 7 as shown in FIG. 14. The searching unit 21 further delivers one time-series vector associated with the driving excitation source code to a gain coding unit 6 as shown in FIG. 14.
In Japanese patent application publication No. 10-312198, the method of making the driving excitation source orthogonal to the adaptive excitation source is implemented by making the perceptual weighted signal to be coded which is input to the pre-table calculating unit 20 orthogonal to the adaptive excitation source, and contributions associated with the correlation between the adaptive excitation source and each driving excitation source pulse are subtracted from E given by equation (5) in the searching unit 21.
A problem encountered with prior art speech coding apparatuses and prior art speech decoding apparatuses constructed as above is that while the pitch-filtering process to generate a pitch-filtered driving excitation source can improve the coding performance without increasing the amount of searching operations, the use of the repetition period of an adaptive excitation source as the repetition period intended for the pitch-filtering process can degrade the quality of speech code generated when the pitch-period of an input speech is different from the repetition period of the adaptive excitation source.
FIG. 18 shows a relationship between a signal to be coded and the locations of pulses included in each pitch-cycle of a pitch-filtered driving excitation source, when the repetition period of the adaptive excitation source is two times the pitch-period of an input speech, in accordance with a prior art speech coding apparatus and a prior art speech decoding apparatus. FIG. 19 shows a relationship between a signal to be coded and the locations of pulses included in each pitch-cycle of a pitch-filtered driving excitation source, when the repetition period of the adaptive excitation source is one-half the pitch-period of an input speech, in accordance with a prior art speech coding apparatus and a prior art speech decoding apparatus.
The repetition period of the adaptive excitation source is determined such that the coding distortion between a synthesized speech generated based on the adaptive excitation source and the signal to be coded is minimized. Therefore the repetition period of the adaptive excitation source is frequently different from the pitch-period of the input speech that is the period of vibrations of the speaker's vocal cords. In this case, the repetition period of the adaptive excitation source is approximately an integral multiple or submultiple of the pitch-period of the input speech. In many cases, the repetition period of the adaptive excitation source is about two times or one-half the pitch-period.
In FIG. 18, since the speaker's vocal cords vibrate in the same way every other pitch-cycle, it is determined that the repetition period of the adaptive excitation source is about two times as large as the pitch-period of the input speech. When the driving excitation source is coded using the repetition period of the adaptive excitation source, most excitation source pulses are concentrated in the first half of the period of each pitch-cycle. The pitch-filtered driving excitation source that is the series of pitch-cycles thus obtained in the current frame using the repetition period of the adaptive excitation source is as shown in FIG. 18. The use of the excitation source pitch-filtered using the repetition period different from the pitch-period of the input speech can cause a change in the tone quality of the frame and hence unstability in the synthesized speech. This disadvantage does not become negligible as the bit rate decreases and the amount of information about the driving excitation source therefore decreases. Frames in which the magnitude of the adaptive excitation source is less than that of the driving excitation source have noticeable degradation of the sound quality.
In FIG. 19, since there is a predominance of low-frequency components in the input speech signal and the waveform of the first half of each pitch-cycle of the input speech is similar to that of the second half of each pitch-cycle, it is determined that the repetition period of the adaptive excitation source is about one-half the pitch-period of the input speech. As in the case of FIG. 18, the use of the excitation source pitch-filtered using the repetition period different from the pitch-period of the input speech can cause a change in the tone quality of the frame and hence unstability in the synthesized speech.
When the bit rate decreases and the amount of information about the driving excitation source therefore decreases, there is a tendency that the driving excitation source determined such that the waveform distortion (or coding distortion) is minimized has a large error in a band of low magnitudes and the synthesized speech therefore has a large spectral distortion. Such a spectral distortion can be detected as degradation of the sound quality. Although a perceptual weighting process is provided in order to eliminate degradation of the sound quality due to spectral distortions, an enhancement of the perceptual weighting process can cause an increase in the waveform distortion and hence degradation of the sound quality showing a ragged sound. The enhancement of the perceptual weighting process is therefore controlled such that the adverse effect on the sound quality by the waveform distortion has the same level as that by the spectral distortion. However, the spectral distortion is increased when the input speech is a female one, and the perceptual weighting process cannot be controlled so that it is optimized for both male and female speeches.
In prior art configurations, a constant magnitude is provided for a plurality of excitation sources, such as pulses, placed at respective locations within each pitch-cycle included in each frame. There is no use in equalizing the magnitudes of the plurality of excitation sources regardless of the difference in the number of candidates for the location of each of the plurality of excitation sources. In the excitation source location table as shown in FIG. 16, three bits are used for each of the excitation source locations numbered 1 to 3 and four bits are used for the remaining excitation source location numbered 4. It is easily expected by examining a maximum of a correlation between each of the plurality of excitation sources placed at a possible location and the signal to be coded that the excitation source number 4 having the largest number of possible locations has a higher probability of providing the largest correlation. Assume an extreme case where no bit is provided for an excitation source number. In the case where no bit is provided for an excitation source number, i.e., one excitation source is fixed at a certain location, the correlation between the excitation source and the signal to be coded is small while the polarity is provided independently. This means that it is not appropriate to provide a larger magnitude for one excitation source as compared with those provided for other excitation sources. The problem with prior art configurations is thus that the magnitudes of the plurality of excitation sources are not optimized.
Although a prior art configuration is disclosed for providing an individual magnitude for each of the plurality of excitation sources through vector quantization during the gain quantization process, the amount of gain-quantized information increases and the gain quantization process increases in complexity.
The above-mentioned technique of making the driving excitation source orthogonal to the adaptive excitation source causes an increase in the amount of searching operations. Therefore, an increase in the number of combinations of algebraic excitation sources puts an enormous load on the coding or decoding process. Especially, when using the technique of making the driving excitation source orthogonal to the adaptive excitation source in a prior art configuration that generates a driving excitation source by placing a plurality of fixed waveforms or performs a pitch-filtering process to generate a pitch-filtered driving excitation source, the amount of arithmetic operations increase greatly.