The present invention relates to speech coding and decoding method for encoding and decoding a speech signal at a low bit rate, and relates to speech coding and decoding apparatus capable of encoding and decoding a speech signal at a low bit rate.
The low bit rate speech coding system conventionally known is 2.4 kbps LPC (i.e., Linear Predictive Coding) or 2.4 kbps MELP (i.e., Mixed Excitation Linear Prediction). Both of these coding systems are the speech coding systems in compliance with the United States Federal Standard. The former is already standardized as FS-1015. The latter is selected in 1997 and standardized as a sound quality improved version of FS-1015.
The following references relate to at least either of 2.4 kbps LPC system and 2.4 kbps MELP system.
[1] FEDERAL STANDARD 1015, xe2x80x9cANALOG TO DIGITAL CONVERSATION OF VOICE BY 2,400 BIT/SECOND LINEAR PREDICTIVE CODING,xe2x80x9d Nov. 28, 1984
[2] Federal Information Processing Standards publication, xe2x80x9cAnalog to Digital Conversation of Voice by 2,400 Bit/Second Mixed Excitation Linear Prediction,xe2x80x9d May 28, 1998 Draft
[3] L. Supplee, R. Cohn, J. Collura and A. McCree, xe2x80x9cMELP: The new federal standard at 2,400 bps,xe2x80x9d Proc. ICASSP, pp.1591-1594, 1997
[4] A. McCree and T. Barnwell III, xe2x80x9cA Mixed Excitation LPC Vocoder Model for Low Bit Rate Speech Coding,xe2x80x9d IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 3, No. 4, pp.242-250, July 1995
[5] D. Thomson and D. Prezas, xe2x80x9cSELECTIVE MODELING OF THE LPC RESIDUAL DURING UNVOICED FRAMES: WHITE NOISE OR PULSE EXCITATION,xe2x80x9d Proc. ICASSP, pp.3087-3090, 1986
[6] Seishi Sasaki and Masayasu Miyake, xe2x80x9cDecoder for a Linear Predictive Analysis/synthesis System,xe2x80x9d Japanese Patent No. 2,711,737 corresponding to the first Japanese Patent Publication No. 03-123,400 published on May 27, 1991.
First, the principle of 2.4 kbps LPC system will be explained with reference to FIGS. 18 and 19 (details of the processing can be found in the above reference [1]).
FIG. 18 is a block diagram showing the circuit arrangement of an LPC type speech encoder. A framing unit 11 is a buffer which stores an input speech sample al having being bandpass-limited to the frequency range of 100-3,600 Hz and sampled at the frequency of 8 kHz and then quantized to the accuracy of at least 12 bits. The framing unit 11 fetches the speech samples (180 samples) for every single speech coding frame (22.5 ms), and sends an output b1 to a speech coding processing section.
Hereinafter, the processing performed for every single speech coding frame will be explained.
A pre-emphasis unit 12 processes the output b1 of the framing unit 11 to emphasize the high-frequency band thereof, and produces a high-frequency band emphasized signal c1. A linear prediction analyzer 13 performs the linear predictive analysis on the received high-frequency band emphasized signal c1 by using the Durbin-Levinson method. The linear prediction analyzer 13 outputs a 10th order reflection coefficient d1 which serves as spectral envelope information. A first quantizer 14 applies the scholar quantization to the 10th order reflection coefficient d1 for each order. The first quantizer 14 sends the quantization result e1 of a total of 41 bits to an error correction coding/bit packing unit 19. Table 1 shows the bit allocation for the reflection coefficients of respective orders.
An RMS (i.e., Root Mean Square) calculator 15 calculates an RMS value representing the level information of the high-frequency band emphasized signal c1 and outputs a calculated RMS value f1. A second quantizer 16 quantizes the RMS value f1 to 5 bits, and outputs a quantized result g1 to the error correction coding/bit packing unit 19.
A pitch detection/voicing unit 17 receives the output b1 of the framing unit 11 and outputs a pitch period h1 (ranging from 20 to 156 samples corresponding to 51-400 Hz) and voicing information i1 (i.e., information for discriminating voiced, unvoiced, and transitional periods). A third quantizer 18 quantizes the pitch period h1 and the voicing information i1 to 7 bits, and outputs a quantized result j1 to the error correction coding/bit packing unit 19. The quantization (i.e., allocation of the pitch information and the voicing information to the 7-bit codes, i.e., a total of 128 codewords) is performed in the following manner. The codeword having 0 in all of the 7 bits and seven codewords having 1 in only one of the 7 bits are allocated to the unvoiced state. The codeword having 1 in all of the 7 bits and seven codewords having 0 in only one of the 7 bits are allocated to the transitional state. Other codewords are used for the voiced state and allocated to the pitch period information.
The error correction coding/bit packing unit 19 packs the received information, i.e., all of the quantization result e1, the quantized result g1, and quantized result j1, into a 54 bit/frame to constitute a speech coding information frame. Thus, the error correction coding/bit packing unit 19 outputs a bit stream k1 consisting of 54 bits per frame. The produced speech information bit stream k1 is transmitted to a receiver via a modulator and a wireless device in case of the radio communications.
Table 1 shows the bit allocation per frame. As understood from this table, the error correction coding/bit packing unit 19 transmits the error correction code (20 bits) when the voicing of the current frame does not indicate the voiced state (i.e., when the voicing of the current frame indicates the unvoiced or transitional period), instead of transmitting 5th to 10th order reflection coefficients. When current frame is the unvoiced or transitional period, the information to be error protected is upper 4 bits of the RMS information and the 1st to 4th order reflection coefficient information. The sync bit of 1 bit is added to each frame.
Next, a circuit arrangement of an LPC type speech decoder will be explained with reference to FIG. 19.
A bit separating/error correcting decoder 21 receives a speech information bit stream a2 consisting of 54 bits for each frame and separates it into respective parameters. When the current frame is an unvoiced or in voicing transition, the bit separating/error correcting decoder 21 applies the error correction decoding processing to the corresponding bits. As a result of the above processing, the bit separating/error correcting decoder 21 outputs a pitch/voicing information bit b2, a 10th order reflection coefficient information bit e2 and an RMS information bit g2.
A pitch/voicing information decoder 22 decodes the pitch/voicing information bit b2, and outputs a pitch period c2 and a voicing information d2. A reflection coefficient decoder 23 decodes the 10th order reflection coefficient information bit e2, and outputs a 10th order reflection coefficient f2. An RMS decoder 24 decodes the RMS information bit g2 and output an RMS information h2.
A parameter interpolator 25 interpolates the parameters c2, d2, f2 and h2 to improve the reproduced speech quality, and outputs the interpolated result (i.e., interpolated pitch period i2, interpolated voicing information j2, interpolated 10th order reflection coefficient o2, and interpolated RMS information r2, respectively).
Next, an excitation signal m2 is produced in the following manner. A voicing switcher 28 selects a pulse excitation k2 generated from a pulse excitation generator 26 in synchronism with the interpolated pitch period i2 when the interpolated voicing information j2 indicates the voiced state. On the other hand, the voicing switcher 28 selects a white noise l2 generated from a noise generator 27 when the interpolated voicing information j2 indicates the unvoiced state. Meanwhile, when the interpolated voicing information j2 indicates the transitional state, the voicing switcher 28 selects the pulse excitation k2 for the voiced portion in this transitional frame and selects the white noise (i.e., pseudo-random excitation) l2 for the unvoiced portion in this transitional frame. In this case, the border between the voiced portion and the unvoiced portion in the same transitional frame is determined by the parameter interpolator 25. The pitch period information i2, used in this case for generating the pulse excitation k2, is the pitch period information of an adjacent voiced frame. An output of the voicing switcher 28 becomes the excitation signal m2.
An LPC synthesis filter 30 is an all-pole filter with a coefficient equal to the linear prediction coefficient p2. The LPC synthesis filter 30 adds the spectral envelope information to the excitation signal m2, and outputs a resulting signal n2. The linear prediction coefficient p2, serving as the spectral envelope information, is calculated by a linear prediction coefficient calculator 29 based on the interpolated reflection coefficient o2. For the voiced speech, the LPC synthesis filter 30 acts as a 10th order all-pole filter with the 10th order linear prediction coefficient p2. For the unvoiced speech, the LPC synthesis filter 30 acts as a 4th order all-pole filter with the 4th order linear prediction coefficient p2.
A gain adjuster 31 adjusts the gain of the output n2 of the LPC synthesis filter 30 by using the interpolated RMS information r2, and generates a gain-adjusted output q2. Finally, a de-emphasis unit 32 processes the gain-adjusted output q2 in a manner opposed to the processing of the previously described pre-emphasis unit 12 to output a reproduced speech s2.
The above-described LPC system includes the following problems (refer to the above reference [4]).
Problem A: The LPC system selectively assigns one of the voiced state, the unvoiced state and the transitional state to each frame in the entire frequency range. However, the excitation signal of natural speech comprises both of voiced-natured bands and unvoiced-natured bands when carefully observed in respective small frequency bands. Accordingly, if the frame is once identified as the voiced state in the LPC system, there is the possibility that the portion to be excited by the noise may be erroneously excited by the pulse. The buzz sound will be caused in this case. This is remarkable in the higher frequency range.
Problem B: In the transitional period from the unvoiced state to the voiced state, the excitation signal may comprise an aperiodic pulse. However, according to the LPC system, it is impossible to express an aperiodic pulse excitation in the transitional period. The tone noise will be caused accordingly.
In this manner, the LPC system possibly produces the buzz sound and the tone noise and therefore causes the problem in that the sound quality of the reproduced speech is mechanical and hard to listen.
To solve the above-described problems, the MELP system has been proposed as a system capable of improving the sound quality (refer to the above references [2] to [4]).
First, the sound quality improvement realized by the MELP system will be explained with reference to FIGS. 20A to 20C. As shown in FIG. 20A, the natural speech consists of a plurality of frequency band components when separated into smaller frequency bands on the frequency axis. Among them, a periodic pulse component is indicated by the white portion. A noise component is indicated by the black portion. When a large part of a concerned frequency band is occupied by the white portion (i.e., by the periodic pulse component), this band is the voiced state. On the other hand, when a large part of a concerned frequency band is occupied by the black portion (i.e., by the noise component), this band is the unvoiced state. The reason why the produced sound of the LPC vocoder becomes the mechanical one as described above is believed that, in the entire frequency range, the excitation of the voiced frame is expressed by the periodic pulse components while the excitation of the unvoiced frame is expressed by the noise components, as shown in FIG. 20B. In the case of the transitional frame, the frame is separated into a voiced state and an unvoiced state on the time axis. To solve this problem, the MELP system applies a mixed excitation by switching the voiced state and the unvoiced state for each sub band, i.e., each of five consecutive frequency bands, in a single frame, as shown in FIG. 20C.
This method is effective in solving the above-described problem xe2x80x9cAxe2x80x9d caused in the LPC system and also in reducing the buzz sound involved in the reproduced speech.
Furthermore, to solve the above-described problem xe2x80x9cBxe2x80x9d caused in the LPC system, the MELP system obtains the aperiodic pulse information and transmits the obtained information to a decoder to produce an aperiodic pulse excitation.
Moreover, to improve the sound quality of the reproduced speech, the MELP system employs an adaptive spectral enhancement filter and a pulse dispersion filter and also utilizes the harmonics amplitude information. Table 2 summarizes the effects of the means employed in the MELP system.
Next, the arrangement of 2.4 kbps MELP system will be explained with reference to FIGS. 21 and 22 (details of the processing can be found in the above reference [2]).
FIG. 21 is a block diagram showing the circuit arrangement of an MELP speech encoder.
A framing unit 41 is a buffer which stores an input speech sample a3 having being bandpass-limited to the frequency range of 100-3,800 Hz and sampled at the frequency of 8 kHz and then quantized to the accuracy of at least 12 bits. The framing unit 41 fetches the speech samples (180 samples) for every single speech coding frame (22.5 ms), and sends an output b3 to a speech coding processing section.
Hereinafter, the processing performed for every single speech coding frame will be explained.
A gain calculator 42 calculates a logarithm of the RMS value serving as the level information of the output b3, and outputs a resulting logarithmic RMS value c3. This processing is performed for each of the first half and the second half of every single frame. Namely, the gain calculator 42 produces two logarithmic RMS values per frame. A first quantizer 43 linearly quantizes the logarithmic RMS value c3 to 3 bits for the first half of the frame and to 5 bits for the second half of the frame. Then, the first quantizer 43 outputs a resulting quantized data d3 to an error-correction coding/bit packing unit 70.
A linear prediction analyzer 44 performs the linear prediction analysis on the output b3 of the framing unit 41 by using the Durbin-Levinson method, and outputs a 10th order linear prediction coefficient e3 which serves as spectral envelope information. An LSF coefficient calculator 45 converts the 10th order linear prediction coefficient e3 into a 10th order LSF (i.e., Line Spectrum Frequencies) coefficient f3. The LSF coefficient is a characteristic parameter equivalent to the linear prediction coefficient but excellent in both of the quantization characteristics and the interpolation characteristics. Hence, many of recent speech coding systems employ the LSF coefficient. A second quantizer 46 quantizes the 10th order LSF coefficient f3 to 25 bits by using a multistage (four stages) vector quantization. The second quantizer 46 sends a resulting quantized LSF coefficient g3 to the error-correction coding/bit packing unit 70.
A pitch detector 54 obtains an integer pitch period from the signal components of 1 kHz or less contained in the output b3 of the framing unit 41. The output b3 of the framing unit 41 is entered into an LPF (i.e., low-pass filter) 55 to produce a bandpass-limited output q3 of 500 Hz or less. The pitch detector 54 obtains a fractional pitch period r3 based on the integer pitch period and the bandpass-limited output q3, and outputs the obtained fractional pitch period r3. The pitch period is given or defined as a delay amount which maximizes a normalized auto-correlation function. The pitch detector 54 outputs a maximum value o3 of the normalized auto-correlation function at this moment. The maximum value o3 of the normalized auto-correlation function serves as information representing the periodic strength of the input signal b3. This information is used in a later-described aperiodic flag generator 56. Furthermore, the maximum value o3 of the normalized auto-correlation function is corrected in a later-described correlation function corrector 53. Then, a corrected maximum value n3 of the normalized auto-correlation function is sent to the error-correction coding/bit packing unit 70 to make the voiced/unvoiced judgement of the entire frequency range. When the corrected maximum value n3 of the normalized auto-correlation function is equal to or smaller than a threshold (=0.6), it is judged that a current frame is an unvoiced state. Otherwise, it is judged that the current frame is a voiced state.
A third quantizer 57 receives the fractional pitch period r3 produced from the pitch detector 54 to convert it into a logarithmic value, and then linearly quantizes the logarithmic value by using 99 levels. A resulting quantized data s3 is sent to the error-correction coding/bit packing unit 70.
A total of four BPFs (i.e., band pass filters) 58, 59, 60 and 61 are provided to produce bandpass-limited signals of different frequency ranges. More specifically, the first BPF 58 receives the output b3 of the framing unit 41 and produces a bandpass-limited output t3 in the frequency range of 500-1,000 Hz. The second BPF 59 receives the output b3 of the framing unit 41 and produces a bandpass-limited output u3 in the frequency range of 1,000-2,000 Hz. The third BPF 60 receives the output b3 of the framing unit 41 and produces a bandpass-limited output v3 in the frequency range of 2,000-3,000 Hz. And, the fourth BPF 61 receives the output b3 of the framing unit 41 and produces a bandpass-limited output w3 in the frequency range of 3,000-4,000 Hz. A total of four auto-correlation calculators 62, 63, 64 and 65 are provided to receive and process the output signals t3, u3, v3 and w3 of BPFs 58, 59, 60 and 61, respectively. More specifically, the first auto-correlation calculator 62 calculates a normalized auto-correlation function of the input signal t3 at a delay amount corresponding to the fractional pitch period r3, and outputs a calculated value x3. The second auto-correlation calculator 63 calculates a normalized auto-correlation function of the input signal u3 at the delay amount corresponding to the fractional pitch period r3, and outputs a calculated value y3. The third auto-correlation calculator 64 calculates a normalized auto-correlation function of the input signal v3 at the delay amount corresponding to the fractional pitch period r3, and outputs a calculated value z3. The fourth auto-correlation calculator 65 calculates normalized auto-correlation function of the input signal w3 at the delay amount corresponding to the fractional pitch period r3, and outputs a calculated value a4.
A total of four voiced/unvoiced flag generators 66, 67, 68 and 69 are provided to generate voiced/unvoiced flags based on the values x3, y3, z3 and a4 produced from the first to fourth auto-correlation calculators 62, 63, 64 and 65, respectively. More specifically, the voiced/unvoiced flag generators 66, 67, 68 and 69 compare the input values x3, y3, z3 and a4 with a threshold (=0.6). The first voiced/unvoiced flag generator 66 judges that the corresponding frequency band is the unvoiced state when the value x3 is equal to or smaller than the threshold and otherwise judges that the corresponding frequency band is the voiced state. Based on this judgement, the first voiced/unvoiced flag generator 66 sends a voiced/unvoiced flag b4 of 1 bit to the correlation function corrector 53. The second voiced/unvoiced flag generator 67 judges that the corresponding frequency band is the unvoiced state when the value y3 is equal to or smaller than the threshold and otherwise judges that the corresponding frequency band is the voiced state. Based on this judgement, the second voiced/unvoiced flag generator 67 sends a voiced/unvoiced flag c4 of 1 bit to the correlation function corrector 53. The third voiced/unvoiced flag generator 68 judges that the corresponding frequency band is the unvoiced state when the value z3 is equal to or smaller than the threshold and otherwise judges that the corresponding frequency band is the voiced state. Based on this judgement, the third voiced/unvoiced flag generator 68 sends a voiced/unvoiced flag d4 of 1 bit to the correlation function corrector 53. The fourth voiced/unvoiced flag generator 69 judges that the corresponding frequency band is the unvoiced state when the value a4 is equal to or smaller than the threshold and otherwise judges that the corresponding frequency band is the voiced state. Based on this judgement, the fourth voiced/unvoiced flag generator 69 sends a voiced/unvoiced flag e4 of 1 bit to the correlation function corrector 53. The produced voiced/unvoiced flags b4, c4, d4 and e4 of respective frequency bands are used in a decoder to produce a mixed excitation.
The aperiodic flag generator 56 receives the maximum value o3 of the normalized auto-correlation function, and outputs an aperiodic flag p3 of 1 bit to the error-correction coding/bit packing unit 70. More specifically, the aperiodic flag p3 is set to ON when the maximum value o3 of the normalized auto-correlation function is smaller than a threshold (=0.5), and is set to OFF otherwise. The aperiodic flag p3 is used in the decoder to produce an aperiodic pulse expressing the excitation of the transitional period and the unvoiced plosives.
A first LPC analysis filter 51 is an all-zero filter with a coefficient equal to the 10th order linear prediction coefficient e3, which removes the spectrum envelope information from the input speech b3 and outputs a residual signal l3.
A peakiness calculator 52 receives the residual signal l3 to calculate a peakiness value and outputs a calculated peakiness value m3. The peakiness value is a parameter representing the probability that a signal may contain a peak-like pulse component (i.e., spike). The above reference [5] defines the peakiness by the following formula.                               peakiness          ⁢                      xe2x80x83                    ⁢          value          ⁢                      xe2x80x83                    ⁢          ρ                =                                                            1                N                            ⁢                                                ∑                                      n                    =                    1                                    N                                ⁢                                  e                  n                  2                                                                                        1              N                        ⁢                                          ∑                                  n                  =                  1                                N                            ⁢                              "LeftBracketingBar"                                  e                  n                                "RightBracketingBar"                                                                        (        1        )            
where N represents the total number of samples in a single frame, and en represents the residual signal.
The numerator of the formula (1) is largely influenced by a large value compared with its denominator. Thus, the peakiness value xe2x80x9cpxe2x80x9d becomes a large value when the residual signal includes a large spike. Accordingly, when a concerned frame has a large peakiness value, there is a large possibility that this frame is a voiced frame with a jitter which is often found in the transitional period or unvoiced plosives. In general, the frame having unvoiced plosives is a signal having a locally appearing spike (i.e., a sharp peak) with the remaining white noise-like portion.
The correlation function corrector 53 receives the peakiness value m3 from the peakiness calculator 52 and corrects the maximum value o3 of the normalized auto-correlation function and the voiced/unvoiced flags b4 and c4 based on the peakiness value m3. The correlation function corrector 53 sets the maximum value o3 of the normalized auto-correlation function to 1.0 (=voiced state) when the peakiness value m3 is larger than 1.34. Furthermore, the correlation function corrector 53 sets the maximum value o3 of the normalized auto-correlation function to 1.0 (=voiced state) and set the voiced/unvoiced flags b4 and c4 to the value indicating the voiced state when the peakiness value m3 is larger than 1.6. Although the voiced/unvoiced flags d4 and e4 are also input to the correlation function corrector 53, no correction is performed for the voiced/unvoiced flags d4 and e4. The correlation function corrector 53 outputs the corrected results as a corrected maximum value n3 of the normalized auto-correlation function and outputs the corrected voiced/unvoiced flags b4 and c4 and non-corrected voiced/unvoiced flags d4 and e4 as respective frequency bands"" voicing information f4.
As described above, the voiced frame with a jitter or unvoiced plosives has a locally appearing spike (i.e., a sharp peak) with the remaining white noise-like portion. Thus, there is a large possibility that its normalized auto-correlation function becomes a value smaller than 0.5. In this case, the aperiodic flag is set to ON. Hence, if voiced frame with a jitter or unvoiced plosives is detected based on the peakiness value, the normalized auto-correlation function can be corrected to 1.0. It will be later judged to be the voiced state in the voiced/unvoiced judgement of the entire frequency range performed in the error-correction coding/bit packing unit 70. In the decoding operation, the sound quality of the voiced frame with a jitter or unvoiced plosives can be improved by using the aperiodic pulse excitation.
Next, the detection of harmonics information will be explained.
A linear prediction coefficient calculator 47 converts the quantized LSF coefficient g3 produced from the second quantizer 46 into a linear prediction coefficient, and outputs a quantized linear prediction coefficient h3. A second LPC analysis filter 48 removes the spectral envelope component from the input signal b3 by using a coefficient equal to the quantized linear prediction coefficient h3, and output a residual signal i3. A harmonics detector 49 detects the amplitude of 10th order harmonics (i.e., harmonic component of the basic pitch frequency) in the residual signal i3, and outputs a detected amplitude j3 of the 10th order harmonics. A fourth quantizer 50 quantizes the amplitude j3 of the 10th order harmonics to 8 bits by using the vector quantization. The fourth quantizer 50 sends a resulting index k3 to the error-correction coding/bit packing unit 70.
The harmonics amplitude information corresponds to the spectral envelope information remaining in the residual signal i3. Accordingly, by transmitting the harmonics amplitude information to the decoder, it becomes possible to accurately express the spectrum of the input signal in the decoding operation. The quality of nasal sound, the capability of discriminating a speaker, and the quality of vowel included in the wide band noise can be enhanced by accurately expressing the spectrum (refer to Table 2-{circle around (5)}).
As described previously, the error-correction coding/bit packing unit 70 sets the unvoiced frame when the corrected maximum value n3 of the normalized auto-correlation function is equal to or smaller than the threshold (=0.6) and set the voiced frame otherwise. The error-correction coding/bit packing unit 70 constitutes a speech information bit stream g4 according to the bit allocation show in Table 3. The speech information bit stream g4 consists of 54 bits per frame. The produced speech information bit stream g4 is transmitted to a receiver via a modulator and a wireless device in case of the radio communications.
In Table 3, the pitch and overall voiced/unvoiced information is quantized to 7 bits. The quantization is performed in the following manner.
Among 7-bit codes (i.e., a total of 128 codewords), the codeword having 0 in all of the 7 bits and seven codewords having 1 in only one of the 7 bits are allocated to the unvoiced state. The codeword having 1 in only 2 bits of the 7 bits is allocated to erasure. Other codewords are used for the voiced state and allocated to the pitch period information (i.e., the output s3 of the third quantizer 57). Regarding the voicing information of respective frequency bands, 1 is allocated for the voiced state and 0 is allocated for the unvoiced state in each of respective outputs b4, c4, d4 and e4. A total of four bits representing the voicing information of respective frequency bands constitute the voicing information f4 to be transmitted. Furthermore, as understood from Table 3, when the concerned frame is the unvoiced frame, the error-correction code of 13 bits is transmitted, instead of transmitting the harmonics amplitude k3, the respective frequency bands"" voicing information f4, and the aperiodic flag p3. In this case, the error correction is applied to the specific bits having important role in the acoustic sense. Furthermore, the sync bit of 1 bit is added to each frame.
Next, a circuit arrangement of a MELP type speech decoder will be explained with reference to FIG. 22.
A bit separating/error correcting decoder 81 receives a speech information bit stream a5 consisting of 54 bits for each frame and obtains the pitch and overall voiced/unvoiced information. When the received frame is the unvoiced frame, the bit separating/error correcting decoder 81 applies the error correction decoding processing to the error protection bits. Furthermore, when the pitch and overall voiced/unvoiced information indicates the erasure, each parameter is replaced by the corresponding value of the previous frame. Then, the bit separating/error correcting decoder 81 outputs the separated information bits: i.e., pitch and overall voiced/unvoiced information b5; aperiodic flag d5; harmonics amplitude index e5; respective frequency bands"" voicing information g5; LSF parameter index j5; and gain information m5. The respective frequency bands"" voicing information g5 is a 5-bit flag representing the voicing information of respective sub-bands 0-500 Hz, 500-1,000 Hz, 1,000-2,000 Hz, 2,000-3,000 Hz, 3,000-4,000 Hz. The voicing information for the sub-band 0-500 Hz is the overall voiced/unvoiced information obtained from the pitch and overall voiced/unvoiced information.
A pitch decoder 82 decodes the pitch period when the pitch and overall voiced/unvoiced information indicates the voiced state, and sets 50.0 as the pitch period when the pitch and overall voiced/unvoiced information indicates the unvoiced state. The pitch decoder 82 outputs a decoded pitch period c5.
A jitter setter 102 receives the aperiodic flag d5 and outputs a jitter value g6 which is set to 0.25 when the aperiodic flag is ON and to 0 when the aperiodic flag is OFF. The jitter setter 102 produces the jitter value g6 of 0.25 when the above voiced/unvoiced information indicates the unvoiced state.
A harmonics decoder 83 decodes the harmonics amplitude index e5 and outputs a decoded 10th order harmonics amplitude f5.
A pulse excitation filter coefficient calculator 84 receives the respective frequency bands"" voicing information g5 and calculates and outputs an FIR filter coefficient h5 which assigns 1.0 to the gain of each voiced sub-band and 0 to the gain of each unvoiced sub-band. A noise excitation filter coefficient calculator 85 receives the respective frequency bands"" voicing information g5 and calculates and outputs an FIR filter coefficient is which assigns 0 to the gain of each voiced sub-band and 1.0 to the gain of each unvoiced sub-band.
An LSF decoder 87 decodes the LSF parameter index j5 and outputs a decoded 10th order LSF coefficient k5. A tilt correction coefficient calculator 86 calculates a tilt correction coefficient l5 based on the 10th order LSF coefficient k5 sent from the LSF decoder 87.
A gain decoder 88 decodes the gain information m5 and outputs a decoded gain n5.
A parameter interpolator 89 linearly interpolates each of input parameters, i.e., pitch period c5, jitter value g6, 10th order harmonics amplitude f5, FIR filter coefficient h5, FIR filter coefficient i5, tilt correction coefficient l5, 10th order LSF coefficient k5, and gain n5, in synchronism with the pitch period. The parameter interpolator 89 outputs the interpolated outputs 05, p5, r5, s5, t5, u5, v5 and w5 corresponding to respective input parameters. The linear interpolation processing is performed in accordance with the following formula:
interpolated parameter=current frame""s parameterxc3x97int+previous frame""s parameterxc3x97(1.0xe2x88x92int)
In this formula, the above input parameters c5, g6, f5, h5, i5, l5, k5, and n5 are the current frame""s parameters. The above output parameters 05, p5, r5, s5, t5, uS, vS and w5 are the interpolated parameters. The previous frame""s parameters are the parameters c5, g6, f5, h5, i5, l5, k5, and n5 in the previous frame which are stored. Furthermore, xe2x80x9cintxe2x80x9d is an interpolation coefficient which is defined by the following formula:
int=t0/180
where 180 is the sample number per speech decoding frame interval (22.5 ms), while xe2x80x9ct0xe2x80x9d is a start point of each pitch period in the decoded frame and is renewed by adding the pitch period in response to every decoding of the reproduced speech of one pitch period. When xe2x80x9ct0xe2x80x9d exceeds 180, it means that the decoding processing of the decoded frame is accomplished. Thus, xe2x80x9ct0xe2x80x9d is initialized by subtracting 180 from it upon accomplishment of the decoding processing of each fame.
A pitch period calculator 90 receives the interpolated pitch period o5 and the interpolated jitter value p5 and calculates a pitch period q5 according to the following formula:
pitch period q5=pitch period o5xc3x97(1.0xe2x88x92jitter value p5xc3x97random number)
where the random number falls within a range from xe2x88x921.0 to 1.0.
According to the above formula, a significant jitter is added to the unvoiced or aperiodic frame because the jitter value 0.25 is set to the unvoiced or aperiodic frame. On the other hand, no jitter is added to the periodic frame because the jitter value 0 is set to the periodic frame. However, as the jitter value is interpolated for each pitch, the jitter value may be a value somewhere in a range from 0 to 0.25. This means that intermediate pitch sections may exist.
In this manner, generating the aperiodic pitch (i.e., jitter-added pitch) based on the aperiodic flag makes it possible to express an irregular (i.e., aperiodic) glottal pulse caused in the transitional period or unvoiced plosives. Thus, the tone noise can be reduced as shown in Table 2-{circle around (2)}.
The pitch period q5, after being converted into an integer value, is supplied to a 1-pitch waveform decoder 101. The 1-pitch waveform decoder 101 decodes and outputs a reproduced speech f6 for every pitch period q5. Accordingly, all of blocks included in the 1-pitch waveform decoder 101 operate in synchronism with the pitch period q5.
A pulse excitation generator 91 receives the interpolated harmonics amplitude r5 and generates a pulse excitation x5 with a single pulse to which the harmonics information is added. Only one pulse excitation x5 is generated during one pitch period q5. A pulse filter 92 is an FIR filter with a coefficient equal to the interpolated pulse filter coefficient s5. The pulse filter 92 applies a filtering operation to the pulse excitation x5 so as to make only the voiced sub bands effective, and outputs the filtered pulse excitation y5. A noise generator 94 generates the white noise a6. A noise filter 93 is an FIR filter with a coefficient equal to the interpolated noise filter coefficient t5. The noise filter 93 applies a filtering operation to the noise excitation a6 so as to make only the unvoiced sub bands effective, and outputs the filtered noise excitation z5.
A mixed excitation generator 95 sums the filtered pulse excitation y5 and the filtered noise excitation z5 to generates a mixed excitation b6. The mixed excitation makes it possible to reduce the buzz sound as the voiced/unvoiced judgement is feasible for each of frequency bands as shown in Table 2-{circle around (1)}.
A linear prediction coefficient calculator 98 calculates a linear prediction coefficient h6 based on the interpolated 10th order LSF coefficient v5. An adaptive spectral enhancement filter 96 is an adaptive pole/zero filter with a coefficient obtained by applying the bandwidth expansion processing to the linear prediction coefficient h6. As shown in Table 2-{circle around (3)}, this enhances the naturalness of the reproduced speech by sharpening the formant resonance and also by improving the similarity to the formant of the natural speech.
Furthermore, the adaptive spectral enhancement filter 96 corrects the tilt of the spectrum based on the interpolated tilt correction coefficient u5 so as to reduce the lowpass muffling effect, and outputs a resulting excitation signal c6.
An LPC synthesis filter 97 is an all-pole filter with a coefficient equal to the linear prediction coefficient h6. The LPC synthesis filter 97 adds the spectral envelope information to the excitation signal c6 produced from the adaptive spectral enhancement filter 96, and outputs a resulting signal d6. A gain adjuster 99 applies the gain adjustment to the output signal d6 of the LPC synthesis filter 97 by using the gain information w5, and outputs a gain-adjusted signal e6. A pulse dispersion filter 100 is a filter for improving the similarity of the pulse excitation waveform with respect to the glottal pulse waveform of the natural speech. The pulse dispersion filter 100 filters the output signal e6 of the gain adjuster 99 and outputs the reproduced speech f6 having improved naturalness. The effect of the pulse dispersion filter 100 is shown in Table 2-{circle around (4)}.
As described above, when compared with the LPC system, the MELP system can provide a reproduced speech excellent in naturalness and also in intelligibility at the same bit rate (2.4 kbps).
Furthermore, to solve the above-described problem xe2x80x9cAxe2x80x9d of the LPC system, the above reference [6] proposes a decoder for a linear prediction analysis/synthesis system which does not require transmission of the voicing information of respective frequency bands used in the MELP system.
More specifically, the reference [6] proposes the decoder for a proposed linear prediction analysis/synthesis system which comprises a separating circuit which receives a digital speech signal having been analysis encoded by a linear prediction analysis/synthesis encoder. Furthermore, the separating circuit separates the parameters of linear prediction coefficient, voiced/unvoiced discrimination signal, excitation strength information, and pitch period information from the digital speech signal. A pitch pulse generator generates a pitch pulse controlled by the pitch period information. A noise generator generates the white noise. A synthesis filter outputs a speech signal decoded in accordance with the linear prediction coefficient using a mixed excitation of the pitch pulse generated from the pitch pulse generator and the white noise generated from the noise generator.
In this decoder for the linear prediction analysis/synthesis system, a processing control circuit is provided to receive the linear prediction coefficient, the voiced/unvoiced discrimination signal, and the excitation strength information from the separating circuit. The processing control circuit obtains a spectral envelope on the frequency axis based on formant synthesizing of the voiced sound, and then compares the obtained spectral envelope with a predetermined threshold. Then, the processing control circuit outputs a pitch component function signal representing the frequency region where the level of the spectral envelope is larger than the threshold and also outputs a noise component function signal representing the frequency region where the level of the spectral envelope is smaller than the threshold. Furthermore, a first output control circuit multiplies the pitch component function signal with the output of the pitch pulse generator to generate a pitch pulse of a frequency region larger than the threshold. A second output control circuit multiplies the noise component function signal with the white noise of the white noise generator to generate the white noise of a frequency region smaller than the threshold. An adder is provided to add the output of the first output control circuit and the output of the second output control circuit to generates an excitation signal for the synthesis filter.
However, the above-described decoder for the proposed linear prediction analysis/synthesis system causes a problem in that the reproduced speech has noise-like sound quality (the reason will be described later), although it can reduce the problem of buzz sound caused in the above-described LPC system.
Skyrocketing spread of mobile communications is seriously requiring the expansion of user accommodation number or capacity. In other words, utilizing the limited frequency resource more effectively is a goal to be attained. Especially, the low-bit rating of the speech coding system is a key technique for solving this problem.
Accordingly, the present invention has an object to provide the speech coding and decoding method and apparatus capable of solving the above-described problems xe2x80x9cAxe2x80x9d and xe2x80x9cBxe2x80x9d of the LPC system at the bit rate lower than 2.4 kbps.
Furthermore, the present invention has another object to provide the speech coding and decoding method and apparatus capable of bringing the comparable effects to the MELP system without transmitting the respective frequency bands"" voicing information or the aperiodic flag.
To accomplish this and other related objects, the present invention provides a first speech decoding method for reproducing a speech signal from a speech information bit stream which is a coded output of the speech signal encoded by a linear prediction analysis and synthesis type speech encoder. The first speech decoding method comprises the steps of separating spectral envelope information, voiced/unvoiced discriminating information, pitch period information and gain information from the speech information bit stream and decoding each separated information, and generating a reproduced speech by summing the spectral envelope information and the gain information to a resultant excitation signal. When the voiced/unvoiced discriminating information indicates a voiced state, a spectral envelope value on a frequency axis is compared with a predetermined threshold to identify a voiced region which is a frequency region where the spectral envelope value is larger than or equal to the predetermined threshold and also to identify an unvoiced region which is a remaining frequency region. The spectral envelope value is calculated based on the spectral envelope information. A pitch pulse generated based on the pitch period information is used as a voiced regional excitation signal, and a mixed signal of the pitch pulse and a white noise mixed at a predetermined ratio is used as an unvoiced regional excitation signal. The above resultant excitation signal is formed by summing the voiced regional excitation signal and the unvoiced regional excitation signal. When the voiced/unvoiced discriminating information indicates an unvoiced state, the above resultant excitation signal is formed based on the white noise.
With this method, it becomes possible to solve the above-described problem xe2x80x9cAxe2x80x9d of the LPC system without transmitting the additional information bits.
Furthermore, the present invention provides a second speech decoding method for reproducing a speech signal from a speech information bit stream which is a coded output of the speech signal encoded by a linear prediction analysis and synthesis type speech encoder. The second speech decoding method comprises a step of separating spectral envelope information, voiced/unvoiced discriminating information, pitch period information and gain information from the speech information bit stream and decoding each separated information, a step of setting voicing strength information to 1.0 when the voiced/unvoiced discriminating information indicates a voiced state and to 0 when the voiced/unvoiced discriminating information indicates an unvoiced state, a step of linearly interpolating the spectral envelope information, the pitch period information, the gain information, and the voicing strength information in synchronism with a pitch period, a step of forming a first mixed excitation signal by mixing a pitch pulse and a white noise at a ratio corresponding to the interpolated voicing strength information, the pitch pulse being produced based on the interpolated pitch period information, a step of comparing a spectral envelope value on a frequency axis with a predetermined threshold to identify a voiced region which is a frequency region where the spectral envelope value is larger than or equal to the predetermined threshold and also to identify an unvoiced region which is a remaining frequency region, the spectral envelope value being calculated based on the interpolated spectral envelope information, a step of using the first mixed excitation signal as a voiced regional excitation signal, and using a mixed signal of the first mixed excitation signal and a white noise mixed at a predetermined ratio as an unvoiced regional excitation signal, a step of forming a second mixed excitation signal by summing the voiced regional excitation signal and the unvoiced regional excitation signal, and a step of generating a reproduced speech by summing the interpolated spectral envelope information and the interpolated gain information to the second mixed excitation signal.
With this method, it becomes possible to solve the above-described problem xe2x80x9cAxe2x80x9d of the LPC system without transmitting the additional information bits.
Furthermore, the present invention provides a first speech coding method for obtaining voiced/unvoiced discriminating information, pitch period information and aperiodic pitch information from an input speech signal, the aperiodic flag indicating whether the pitch is a periodic pitch or an aperiodic pitch, and the input speech signal being a sampled signal divided into a speech coding frame having a predetermined time interval. The first speech coding method comprises a step of quantizing the pitch period information with a first predetermined level number to produce periodic pitch information in a speech coding frame where the aperiodic flag indicates a periodic pitch, a step of allocating a quantized level in accordance with each occurrence frequency with respect to respective pitch ranges and performing a quantization with a second predetermined level number to produce aperiodic pitch information in a speech coding frame where the aperiodic flag indicates an aperiodic pitch, a step of allocating a single codeword to a condition where the voiced/unvoiced discriminating information indicates an unvoiced state, a step of allocating a predetermined number of codewords corresponding to the first predetermined level number to the periodic pitch information while allocating a predetermined number of codewords corresponding to the second predetermined level number to the aperiodic pitch information in a condition where the voiced/unvoiced discriminating information indicates a voiced state, and a step of encoding the allocated single codeword or codewords into a codeword having a predetermined bit number.
Preferably, the predetermined bit number of the codeword is 7 bits. A codeword having 0 (or 1) in all of the 7 bits is allocated to the condition where the voiced/unvoiced discriminating information indicates an unvoiced state. A codeword having 0 (or 1) in 1 or 2 bits of the 7 bits is allocated to the aperiodic pitch information. And the periodic pitch information is allocated to other codewords.
With this method, it becomes possible to solve the above-described problem xe2x80x9cBxe2x80x9d of the LPC system without transmitting the additional information bits.
Furthermore, it becomes possible to realize a low-bit rate speech coding.
Furthermore, the present invention provides a speech coding and decoding method comprising the above-described first speech coding method and either of the above-described first and second speech decoding methods.
With this method, it becomes possible to solve the above-described problems xe2x80x9cAxe2x80x9d and xe2x80x9cBxe2x80x9d of the LPC system without transmitting the additional information bits.
Furthermore, the present invention provides a first speech coding apparatus, according to which a framing unit receives a quantized speech sample which is sampled at a predetermined sampling frequency and outputs a predetermined number of speech samples for each speech coding frame having a predetermined time interval. A gain calculator calculates a logarithm of an RMS value and outputs a resulting logarithmic RMS value. The RMS value serves as level information for one frame of speech sample. A first quantizer linearly quantizes the logarithmic RMS value and outputs a resulting quantized logarithmic RMS value. A linear prediction analyzer applies a linear prediction analysis to the one frame of speech sample and outputs a linear prediction coefficient of a predetermined order which serves as spectral envelope information. An LSF coefficient calculator converts the linear prediction coefficient into an LSF (i.e., Line Spectrum Frequencies) coefficient and outputs the LSF coefficient. A second quantizer quantizes the LSF coefficient and outputs a resulting quantized value as an LSF parameter index. A low pass filter filters the one frame of speech sample with a predetermined cutoff frequency and outputs a bandpass-limited input signal. A pitch detector obtains a pitch period from the bandpass-limited input signal based on calculation of a normalized auto-correlation function and outputs the pitch period and a maximum value of the normalized auto-correlation function. A third quantizer linearly quantizes the pitch period, after having been converted into a logarithmic value, with a first predetermined level number and outputs a resulting quantized value as a pitch period index. An aperiodic flag generator receives the maximum value of the normalized auto-correlation function and outputs an aperiodic flag being set to ON when the maximum value is smaller than a predetermined value and being set to OFF otherwise. An LPC analysis filter removes the spectral envelope information from the one frame of speech sample by using a coefficient equal to the linear prediction coefficient, and outputs a filtered result as a residual signal. A peakiness calculator receives the residual signal, calculates a peakiness value based on the residual signal, and outputs the calculated peakiness value. A correlation function corrector corrects the maximum value of the normalized auto-correlation function based on the peakiness value of the peakiness calculator and outputs a corrected maximum value of the normalized auto-correlation function. A voiced/unvoiced identifier generates a voiced/unvoiced flag which represents an unvoiced state when the corrected maximum value of the normalized auto-correlation function is equal to or smaller than a predetermined value and represents a voiced state otherwise. An aperiodic pitch index generator applies a nonuniform quantization with a second predetermined level number to the pitch period of a frame being aperiodic according to the aperiodic flag, and outputs an aperiodic pitch index. A periodic/aperiodic pitch and voiced/unvoiced information code generator receives the voiced/unvoiced flag, the aperiodic flag, the pitch period index, and the aperiodic pitch index and outputs a periodic/aperiodic pitch and voiced/unvoiced information code of a predetermined bit number by coding the voiced/unvoiced flag, the aperiodic flag, the pitch period index, and the aperiodic pitch index. And, a bit packing unit receives the quantized logarithmic RMS value, the LSF parameter index, and the periodic/aperiodic pitch and voiced/unvoiced information code, and outputs a speech information bit stream by performing a bit packing for each frame.
Furthermore, the present invention provides a first speech decoding apparatus, according to which a bit separator separates the speech information bit stream of each frame produced by a speech coding apparatus in accordance with respective parameters, and outputs a periodic/aperiodic pitch and voiced/unvoiced information code, a quantized logarithmic RMS value, and an LSF parameter index. A voiced/unvoiced information and pitch period decoder receives the periodic/aperiodic pitch and voiced/unvoiced information code and outputs a pitch period and a voicing strength, in such a manner that the pitch period is set to a predetermined value and the voicing strength is set to 0 when a current frame is in an unvoiced state, while the pitch period is decoded in accordance with a coding regulation for the pitch period and the voicing strength is set to 1.0 when the current frame is in either a periodic state or aperiodic state. A jitter setter receives the periodic/aperiodic pitch and voiced/unvoiced information code and outputs a jitter value which is set to a predetermined value when the current frame is in the unvoiced state or in the aperiodic state and is set to 0 when the current frame is in the periodic state. An LSF decoder decodes the LSF coefficient of a predetermined order from the LSF parameter index and outputs a decoded LSF coefficient. A tilt correction coefficient calculator calculates a tilt correction coefficient from the decoded LSF coefficient, and outputs a calculated tilt correction coefficient. A gain decoder decodes the quantized logarithmic RMS value and outputs a gain. A parameter interpolator linearly interpolates each of the pitch period, the voicing strength, the jitter value, the LSF coefficient, the tilt correction coefficient, and the gain in synchronism with the pitch period, and outputs an interpolated pitch period, an interpolated voicing strength, an interpolated jitter value, an interpolated LSF coefficient, an interpolated tilt correction coefficient, and an interpolated gain. A pitch period calculator receives the interpolated pitch period and the interpolated jitter value to add jitter to the interpolated pitch period, and outputs a pitch period (hereinafter, referred to as integer pitch period) converted into an integer value. And, a 1-pitch waveform decoder decodes a reproduced speech corresponding to the integer pitch period in synchronism with the integer pitch period. According to this 1-pitch waveform decoder, a single pulse generator generates a single pulse signal within a duration of the integer pitch period. A noise generator generates a white noise having an interval equivalent to the integer pitch period. A first mixed excitation generator synthesizes the single pulse signal and the white noise based on the interpolated voicing strength to output a first mixed excitation signal. A linear prediction coefficient calculator calculates a linear prediction coefficient based on the interpolated LSF coefficient. A spectral envelope shape calculator obtains spectral envelope shape information of the reproduced speech based on the linear prediction coefficient, and outputs the obtained spectral envelope shape information. A mixed excitation filtering unit compares a value of the spectral envelope shape information with a predetermined threshold to identify a voiced region which is a frequency region where the value of the spectral envelope shape information is larger than or equal to the predetermined threshold and also to identify an unvoiced region which is a remaining frequency region. Then, the mixed excitation filtering unit outputs a first DFT coefficient string and a second DFT coefficient string. The first DFT coefficient string includes 0 values corresponding to the unvoiced region among DFT coefficients of the first mixed excitation information, while the second DFT coefficient string includes 0 values corresponding to the voiced region among the DFT coefficients of the first mixed excitation information. A noise excitation filtering unit outputs a DFT coefficient string including 0 values corresponding to the voiced region among DFT coefficients of the white noise. A second mixed excitation generator mixes the second DFT coefficient string of the mixed excitation filtering unit and the DFT coefficient string of the noise excitation filtering unit at a predetermined ratio, and outputs a resulting DFT coefficient string. A third mixed excitation generator sums the DFT coefficient string produced from the second mixed excitation generator and the first DFT coefficient string produced from the mixed excitation filtering unit, and applies an inverse Discrete Fourier transform to the summed-up DFT coefficient string to output an obtained result as a mixed excitation signal. A switcher receives the interpolated voicing strength to select the white noise when the interpolated voicing strength is 0 and also to select the mixed excitation signal produced from the third mixed excitation generator when the interpolated voicing strength is not 0, and outputs the selected one as a mixed excitation signal. An adaptive spectral enhancement filter outputs an excitation signal having an improved spectrum as a result of a filtering of the mixed excitation signal. The adaptive spectral enhancement filter is a cascade connection of an adaptive pole/zero filter with a coefficient obtained by applying the bandwidth expansion processing to the linear prediction coefficient and a spectral tilt correcting filter with a coefficient equal to the interpolated tilt correction coefficient. An LPC synthesis filter adds spectral envelope information to an excitation signal improved in the spectrum and outputs a signal accompanied with the spectral envelope information. The LPC synthesis filter is an all-pole filter using a coefficient equal to the linear prediction coefficient. A gain adjuster applies gain adjustment to the signal accompanied with the spectral envelope information by using the gain and outputs a reproduced speech signal. And, a pulse dispersion filter applies pulse dispersion processing to the reproduced speech signal, and outputs a pulse dispersion processed reproduced speech signal.
Moreover, the present invention provides a third speech decoding method for reproducing a speech signal from a speech information bit stream which is a coded output of the speech signal encoded by a linear prediction analysis and synthesis type speech encoder. The third speech decoding method comprises a step of separating spectral envelope information, voiced/unvoiced discriminating information, pitch period information and gain information from the speech information bit stream and decoding each separated information, a step of obtaining a spectral envelope amplitude from the spectral envelope information, and identifying a frequency band having a largest spectral envelope amplitude among a plurality of frequency bands divided on a frequency axis, a step of determining a mixing ratio for each of the plurality of frequency bands based on the identified frequency band and the voiced/unvoiced discriminating information, the mixing ratio being used in mixing a pitch pulse generated in response to the pitch period information and white noise, a step of producing a mixing signal for each of the plurality of frequency bands based on the determined mixing ratio, and then producing a mixed excitation signal by summing all of the mixing signals of the plurality of frequency bands, and a step of producing a reproduced speech by adding the spectral envelope information and the gain information to the mixed excitation signal.
With this method, it becomes possible to solve the above-described problem xe2x80x9cAxe2x80x9d of the LPC system without transmitting the additional information bits.
Furthermore, the present invention provides a fourth speech decoding method for reproducing a speech signal from a speech information bit stream, including spectral envelope information, low-frequency band voiced/unvoiced discriminating information, high-frequency band voiced/unvoiced discriminating information, pitch period information and gain information, which is a coded output of the speech signal encoded by a linear prediction analysis and synthesis type speech encoder. The fourth speech decoding method comprises a step of separating the spectral envelope information, low-frequency band voiced/unvoiced discriminating information, high-frequency band voiced/unvoiced discriminating information, pitch period information and gain information from the speech information bit stream and decoding each separated information, a step of determining a mixing ratio of the low-frequency band based on the low-frequency band voiced/unvoiced discriminating information, the mixing ratio being used in mixing a pitch pulse generated in response to the pitch period information and white noise for the low-frequency band, and producing a mixing signal for the low-frequency band, a step of obtaining a spectral envelope amplitude from the spectral envelope information, and identifying a frequency band having a largest spectral envelope amplitude among a plurality of high-frequency bands divided on a frequency axis, a step of determining a mixing ratio for each of the plurality of high-frequency bands based on the identified frequency band and the high-frequency band voiced/unvoiced discriminating information, the mixing ratio being used in mixing a pitch pulse generated in response to the pitch period information and white noise for each of the high-frequency bands, and producing a mixing signal of each of the plurality of high-frequency bands, and then producing a mixing signal for the high-frequency band corresponding to a summation of all of the mixing signals of the plurality of high-frequency bands, a step of producing a mixed excitation signal by summing the mixing signal for the low-frequency band and the mixing signal for the high-frequency band, and a step of producing a reproduced speech by adding the spectral envelope information and the gain information to the mixed excitation signal.
With this method, it becomes possible to solve the above-described problem xe2x80x9cAxe2x80x9d of the LPC system and improve the sound quality of the reproduced speech.
Furthermore, the present invention provides a fifth speech decoding method for reproducing a speech signal from a speech information bit stream, including spectral envelope information, low-frequency band voiced/unvoiced discriminating information, high-frequency band voiced/unvoiced discriminating information, pitch period information and gain information, which is a coded output of the speech signal encoded by a linear prediction analysis and synthesis type speech encoder. The fifth speech decoding method comprises a step of separating each of the spectral envelope information, the low-frequency band voiced/unvoiced discriminating information, the high-frequency band voiced/unvoiced discriminating information, the pitch period information and the gain information from the speech information bit stream and decoding each separated information, a step of determining a mixing ratio of the low-frequency band based on the low-frequency band voiced/unvoiced discriminating information, the mixing ratio being used in mixing a pitch pulse generated in response to the pitch period information being linearly interpolated in synchronism with the pitch period and white noise for the low-frequency band, a step of obtaining a spectral envelope amplitude from the spectral envelope information, and identifying a frequency band having a largest spectral envelope amplitude among a plurality of high-frequency bands divided on a frequency axis, a step of determining a mixing ratio for each of the plurality of high-frequency bands based on the identified frequency band and the high-frequency band voiced/unvoiced discriminating information, the mixing ratio being used in mixing a pitch pulse in response to the pitch period information being linearly interpolated in synchronism with the pitch period and white noise for each of the plurality of high-frequency bands, a step of linearly interpolating the spectral envelope information, the pitch period information, the gain information, the mixing ratio of the low-frequency band, the mixing ratio of each of the plurality of high-frequency bands, in synchronism with the pitch period, a step of producing a mixing signal for the low-frequency band by mixing the pitch pulse and the white noise with reference to the interpolated mixing ratio of the low-frequency band, a step of producing a mixing signal of each of the plurality of high-frequency bands by mixing the pitch pulse and the white noise with reference to the interpolated mixing ratio for each of the plurality of high-frequency bands, and then producing a mixing signal for the high-frequency band corresponding to a summation of all of the mixing signals of the plurality of high-frequency bands, a step of producing a mixed excitation signal by summing the mixing signal for the low-frequency band and the mixing signal for the high-frequency band, and a step of producing a reproduced speech by adding the interpolated spectral envelope information and the interpolated gain information to the mixed excitation signal.
With this method, it becomes possible to solve the above-described problem xe2x80x9cAxe2x80x9d of the LPC system and improve the sound quality of the reproduced speech.
Preferably, the plurality of high-frequency bands are separated into three frequency bands. When the high-frequency band voiced/unvoiced discriminating information indicates a voiced state, the mixing ratio of each of the three high-frequency bands is determined in the following manner: when the spectral envelope amplitude is maximized in the first or second lowest frequency band, the ratio of pitch pulse (hereinafter, referred to as xe2x80x9cvoicing strengthxe2x80x9d) monotonously decreases with increasing frequency of each of the plurality of high-frequency bands; and when the spectral envelope amplitude is maximized in the highest frequency band, the ratio of pitch pulse for the second lowest frequency band is smaller than the voicing strength for the first lowest frequency band while the voicing strength for the highest frequency band is larger than the ratio of pitch pulse for the second lowest frequency band.
Preferably, the plurality of high-frequency bands are separated into three frequency bands. The mixing ratio of each of the three high-frequency bands, when the high-frequency band voiced/unvoiced discriminating information indicates a voiced state, is determined in such a manner that a voicing strength of one of three frequency bands, when the spectral envelope amplitude is maximized in the one of three frequency bands, is larger than a corresponding voicing strength of the one of three frequency bands in a case where the spectral envelope amplitude of other two frequency bands is maximized.
Preferably, the plurality of high-frequency bands are separated into three frequency bands. The mixing ratio of each of the three high-frequency bands, when the high-frequency band voiced/unvoiced discriminating information indicates an unvoiced state, is determined in such a manner that a voicing strength of one of three frequency bands, when the spectral envelope amplitude is maximized in the one of three frequency bands, is smaller than a corresponding voicing strength of the one of three frequency bands in a case where the spectral envelope amplitude of other two frequency bands is maximized.
Furthermore, the present invention provides a second speech coding apparatus, according to which a framing unit receives a quantized speech sample which is sampled at a predetermined sampling frequency and outputs a predetermined number of speech samples for each speech coding frame having a predetermined time interval. A gain calculator calculates a logarithm of an RMS value and outputs a resulting logarithmic RMS value. The RMS value serves as level information for one frame of speech sample. A first quantizer linearly quantizes the logarithmic RMS value and outputs a resulting quantized logarithmic RMS value. A linear prediction analyzer applies a linear prediction analysis to the one frame of speech sample and outputs a linear prediction coefficient of a predetermined order which serves as spectral envelope information. An LSF coefficient calculator converts the linear prediction coefficient into an LSF (i.e., Line Spectrum Frequencies) coefficient and outputs the LSF coefficient. A second quantizer quantizes the LSF coefficient and outputs a resulting quantized value as an LSF parameter index. A low pass filter filters the one frame of speech sample with a predetermined cutoff frequency and outputs a low frequency band input signal. A pitch detector obtains a pitch period from the low frequency band input signal based on calculation of a normalized auto-correlation function and outputs the pitch period and a maximum value of the normalized auto-correlation function. A third quantizer linearly quantizes the pitch period, after having been converted into a logarithmic value, with a first predetermined level number and outputs a resulting quantized value as a pitch period index. An aperiodic flag generator receives the maximum value of the normalized auto-correlation function and outputs an aperiodic flag being set to ON when the maximum value is smaller than a predetermined value and being set to OFF otherwise. An LPC analysis filter removes the spectral envelope information from the one frame of speech sample by using a coefficient equal to the linear prediction coefficient, and outputs a filtered result as a residual signal. A peakiness calculator receives the residual signal, calculates a peakiness value based on the residual signal, and outputs the calculated peakiness value. A correlation function corrector corrects the maximum value of the normalized auto-correlation function based on the peakiness value of the peakiness calculator and outputs a corrected maximum value of the normalized auto-correlation function. A first voiced/unvoiced identifier generates a voiced/unvoiced flag which represents an unvoiced state when the corrected maximum value of the normalized auto-correlation function is equal to or smaller than a predetermined value and represents a voiced state otherwise. An aperiodic pitch index generator applies a nonuniform quantization with a second predetermined level number to the pitch period of a frame being aperiodic according to the aperiodic flag and outputs an aperiodic pitch index. A periodic/aperiodic pitch and voiced/unvoiced information code generator receives the voiced/unvoiced flag, the aperiodic flag, the pitch period index, and the aperiodic pitch index and outputs a periodic/aperiodic pitch and voiced/unvoiced information code of a predetermined bit number by coding the voiced/unvoiced flag, the aperiodic flag, the pitch period index, and the aperiodic pitch index. A high pass filter filters the one frame of speech sample with a predetermined cutoff frequency and outputs a high frequency band input signal. A correlation function calculator calculates a normalized auto-correlation function at a delay amount corresponding to the pitch period based on the high frequency band input signal. A second voiced/unvoiced identifier generates a high-frequency band voiced/unvoiced flag which represents an unvoiced state when a maximum value of the normalized auto-correlation function generated from the correlation function calculator is equal to or smaller than a predetermined value and represents a voiced state otherwise. And, a bit packing unit receives the quantized logarithmic RMS value, the LSF parameter index, and the periodic/aperiodic pitch and voiced/unvoiced information code and the high-frequency band voiced/unvoiced flag, and outputs a speech information bit stream by performing a bit packing for each frame.
Furthermore, the present invention provides a second speech decoding apparatus decoding the speech information bit stream of each frame encoded by a speech coding apparatus. The second speech decoding apparatus comprises a bit separator separates the speech information bit stream into respective parameters, and outputs a periodic/aperiodic pitch and voiced/unvoiced information code, a quantized logarithmic RMS value, an LSF parameter index, and a high-frequency band voiced/unvoiced flag. A voiced/unvoiced information and pitch period decoder receives the periodic/aperiodic pitch and voiced/unvoiced information code and outputs a pitch period and a voiced/unvoiced flag, in such a manner that the pitch period is set to a predetermined value and the voiced/unvoiced flag is set to 0 when a current frame is in an unvoiced state, while the pitch period is decoded in accordance with a coding regulation for the pitch period and the voiced/unvoiced flag is set to 1.0 when the current frame is in either a periodic state or aperiodic state. A jitter setter receives the periodic/aperiodic pitch and voiced/unvoiced information code and outputs a jitter value which is set to a predetermined value when the current frame is the unvoiced state or the aperiodic state and is set to 0 when the current frame is the periodic state. An LSF decoder decodes a predetermined order of LSF coefficient from the LSF parameter index and outputs a decoded LSF coefficient. A tilt correction coefficient calculator calculates a tilt correction coefficient from the decoded LSF coefficient, and outputs a calculated tilt correction coefficient. A gain decoder decodes the quantized logarithmic RMS value and outputs a decoded gain. A first linear prediction coefficient calculator converts the decoded LSF coefficient into a linear prediction coefficient and outputs the resulting linear prediction coefficient. A spectral envelope amplitude calculator calculates a spectral envelope amplitude based on the linear prediction coefficient produced from the first linear prediction coefficient calculator. A pulse excitation/noise excitation mixing ratio calculator receives the voiced/unvoiced flag, the high-frequency band voiced/unvoiced flag, and the spectral envelope amplitude, and outputs determined mixing ratio information used in mixing a pulse excitation and white noise for each of a plurality of frequency bands (hereinafter, referred to as sub-bands) divided on a frequency axis. A parameter interpolator linearly interpolates each of the pitch period, the mixing ratio information, the jitter value, the LSF coefficient, the tilt correction coefficient, and the gain in synchronism with the pitch period, and outputs an interpolated pitch period, an interpolated mixing ratio information, an interpolated jitter value, an interpolated LSF coefficient, an interpolated tilt correction coefficient, and an interpolated gain. A pitch period calculator receives the interpolated pitch period and the interpolated jitter value to add jitter to the interpolated pitch period, and outputs a pitch period (hereinafter, referred to as integer pitch period) converted into an integer value. And, a 1-pitch waveform decoder decodes a reproduced speech corresponding to the integer pitch period in synchronism with the integer pitch period. According to this 1-pitch waveform decoder, a single pulse generator generates a single pulse signal within a duration of the integer pitch period. A noise generator generates a white noise having an interval equivalent to the integer pitch period. A mixed excitation generator mixes the single pulse signal and the white noise for each sub-band based on the interpolated mixing ratio information, and then synthesizes a mixed excitation signal equivalent to a summation of all of the produced mixing signals of the sub-bands. A second linear prediction coefficient calculator calculates a linear prediction coefficient based on the interpolated LSF coefficient. An adaptive spectral enhancement filter outputs an excitation signal having an improved spectrum as a result of a filtering of the mixed excitation signal. The adaptive spectral enhancement filter is a cascade connection of an adaptive pole/zero filter with a coefficient obtained by applying the bandwidth expansion processing to the linear prediction coefficient and a spectral tilt correcting filter with a coefficient equal to the interpolated tilt correction coefficient. An LPC synthesis filter adds spectral envelope information to an excitation signal improved in the spectrum and outputs a signal accompanied with the spectral envelope information. The LPC synthesis filter is an all-pole filter with a coefficient equal to the linear prediction coefficient. A gain adjuster applies gain adjustment to the signal accompanied with the spectral envelope information by using the gain and outputs a reproduced speech signal. And, a pulse dispersion filter applies pulse dispersion processing to the reproduced speech signal and outputs a pulse dispersion processed reproduced speech signal.