1. Field of the Invention
The present invention relates to a speech synthesizing apparatus, more particularly to a post-filter for the speech synthesizing apparatus which is capable of reproducing any sound except voice without deterioration.
2. Description of the Related Art
The inventors of the present invention know that a speech synthesizing apparatus for reproducing a compressed or coded speech which utilizes a post-filter for enhancing a quality of the synthesized speech. This post-filter realizes a function of shaping noises by using an audio masking characteristic of a human being. The post-filter is normally used for the speech synthesizing apparatus which utilizes a coding method such as a code-excited linear prediction (referred to as a CELP).
The noise shaping indicates a function of processing a spectrum form of an error signal caused between a synthesized speech and an original speech to be likewise to the spectrum form of the original speech, expanding an energy difference between an original speech and a noise in a valley of the spectrum, and suppressing the acoustically sensing range of the noise by the masking characteristic.
The post-filter is normally located immediately after a decoder provided in the speech synthesizing apparatus.
In general, the post-filter has a transfer function H(z) represented by the following expression EQU H(z)=P'(z)/P"
wherein 1/P(z) is a transfer function of a spectrum envelope synthesizing filter used in a decoder. The denominator P(z) is a short-period filter, a spectrum envelope prediction filter or a reverse filter (herein, referred to as a reverse filter). The denominator P(z) may be represented by the following expression. EQU P(z)=1-.SIGMA..alpha..sub.i z.sup.-i
wherein .alpha..sub.i is an i-degree linear prediction coefficient with i being a positive integer (if p is a positive integer, the prediction degree may be represented by p). Both of P'(z) and P"(z) have an expanded band of a peak (formant) of the spectrum of the reverse filter P(z). P'(z) has a more expanded band than P"(z).
The filter serves to intensify the formant of the synthesized speech output from the decoder. Hence, the energy is condensed at the formant of the error spectrum against the spectrum of the original speech so that the form of the error spectrum may come closer to the form of the spectrum of the original speech.
In general, P'(z) and P"(z) are represented by the following expressions, respectively. EQU P'(z)=P(z/.eta.)=1-.SIGMA..alpha..sub.i .eta..sup.i z.sup.-i EQU P"(z)=P(z/v)=1-.SIGMA..alpha..sub.i v.sup.i z.sup.-i (0&lt;.eta.&lt;v&lt;1)
These relational expressions are described in J. H. Chain, A. Gersho, "Real-Time Vector APE Speech Coding at 48800 bps with Adaptive Postfilter", Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 51.3.31-51.3.4, April, 1987.
The decoding method implemented in the speech synthesizing apparatus having the post-filter is arranged to receive a linear prediction coefficient at every certain time (normally referred to as a frame), in some cases, interpolate the linear prediction coefficient received at each of the divided frames (which is referred to as subframes), and synthesize the speech by using the interpolated linear prediction coefficient.
The factor of the post-filter is derived from the interpolated linear prediction coefficient and the gain of the post-filter changes depending on the linear prediction coefficient.
The foregoing post-filter includes an automatic gain control function for returning the energy of the synthesized speech amplified or attenuated by the gain into the energy of the synthesized speech before it is passed through the post-filter. The automatic gain control function will be referred to as an AGC function.
In turn, the description will be directed to a method of implementing the AGC function. This method is described in I. A. Gerson, M. A. Jaisuk, "Vector Sum Excited Linear Prediction (VSELP) Speech Coding at 8 kbps", Proc. IEEE Int. Conf. on Acoustics, Speech and Signal Processing, pp. 461-464, April, 1990.
This method is arranged to take the steps of deriving a scaling factor S and multiplying the signal immediately after the post-filter by the scaling factor S for obtaining the energy before and after the post-filter in the subframe or the frame. Then, the step is taken of obtaining a ratio of a square of the energy before the post-filter to that of the energy after the post-filter in the subframe (frame) as a temporary scaling factor S'.
In case that the temporary scaling factor S' is directly used in the AGC, the factor S' may be greatly variable according to each subframe (frame). Hence, the synthesized speech becomes discontinuous on the border of the adjacent subframes (frames). The discontinuity brings about the noise at the cut portion of the synthesized speech. To avoid this shortcoming, the temporary scaling factor S' is passed through a primary low-pass filter as gradually changing its scaling filter. This relation will be represented by the following expression. EQU S(n)=.zeta.S(n)+(1-.zeta.)S', 0&lt;.zeta.&lt;1, n=0, 1, . . . , N-1
wherein n (positive integer) represents a sampling time point within a subframe (frame), N (positive integer) represents the number of samples within a subframe (frame), and S(-1) on the right side is S(N-1) of the previous subframe (previous frame) when S(0) is obtained. To suppress abrupt variation of the scaling factor S(n), the constant .zeta. may normally take 1 or a value closer to 1.
In various kinds of telephone services, when the phone is pending, a melody sounds onto the phone line or when dialing the phone, a dual tone multi-frequency signal (referred to as a DTMF) is used. In case that a phone includes a speech synthesizing apparatus implemented according to the method for coding the VSELP and provided with an AGC-function-attached post-filter on the reproducing side, the tone signal such as a melody is reproduced together with a speech.
The foregoing speech synthesizing apparatus, however, may provide greatly variable linear prediction coefficients on a change point of a tone or a leading edge after the silence, resulting in greatly changing the gain of the post-filter. In such a case, the post-filter may increase the amplitude of the tone signal from the start point of the subframe (frame), when the temporary scaling factor S' is far smaller than that at the previous subframe (frame). When the actual scaling factor S(n) has a small value of n, however, the scaling factor S(n) has a greatly different value from the temporary scaling factor S'. Hence, the scaling factor S(n) is not endurable to suppressing the increased amplitude of the tone signal.
The above-described shortcoming will be more concretely described with reference to FIGS. 1a to 1d.
FIG. 1a shows a synthesized tone signal immediately before it passes through the post-filter of the speech synthesizing apparatus. FIGS. 1b and 1c are a synthesized tone signal immediately after it passes through the post-filter, in which the wave of FIG. 1b corresponds to the wave before through the effect of the AGC and the wave of FIG. 1c corresponds to the wave after through the effect of the AGC. FIG. 1d shows the scaling factor S(n) and the temporary scaling factor S' of the AGC in FIG. 1c. When the post-filter serves to abruptly increase the amplitude of the synthesized tone signal as shown in FIG. 1b as compared to that shown in FIG. 1a, as shown in FIG. 1d, the temporary scaling factor S' is greatly different from the scaling factor S(0) at the starting point n=0 of the subframe or the frame so that the scaling factor S(n) needs a considerably long time to come closer to the temporary scaling factor S'. The AGC, therefore, cannot suppress the increased amplitude as shown in FIG. 1b, resulting in making the amplitude greatly changed as shown in FIG. 1c.
The increased amplitude of the synthesized signal may exceed the range in which the amplitude value can be D/A converted. When it exceeds the range, a large sound "pop" appears. Further, if it stays in the range, the waveform of the synthesized signal is greatly different from that of the original sound, resulting in making the quality of the synthesized signal inferior.