The research of the low bit rate coding is primarily applied in the field of commercial satellite communication and secure military communication. Recently, three major vocal coding standards, FS1015 LPC-10e, INMARSAT-M MBE, FS1016 CELP, are set at 2400, 4150 and 4800 bps bit rates, respectively.
Sinusoidal Transform Coder (STC) is proposed by Quatieri and McAulay who are researchers in MIT. The wave form of speech exhibits the characteristic of periodicity and the speech spectrum has a high peak density, thus the STC uses the multi sine-wave excitation filters to synthesize speech signal and compares the signals to the initial input signal to determine the frequency, amplitude and phase of each individual sine-waves. Further details can be found in an article proposed by T. F. Quatieri, R. J. McAulay, "Speech Transforms Based on Sinusoidal Representation", IEEE, Trans. on Acoust, and Signal Process, 1986.
The requirement of the vocoder with low bit rate can not be achieved by directly quantizing the parameters according to the sine waves. The frequencies of the sine waves are regarded as the composition of a plurality of certain individual harmonic frequencies. To maintain the phase continuation between the frames, the phase parameters obtain the vocal trace filter phase response by the postulation of the minimum phase and synchronize the onset time of the excitation. Further, the sine wave amplitude is simulated using cepstral or all-pole model to achieve the purpose of simplifying the parameters. The method could simplify the parameter bits and effectively synthesize the signal to get the initial vocal signal. Therefore, it can achieve the requirement of coding with 2.4 Kbps low bit rate.
The sine wave amplitude coding is represented by the following formula (1): ##EQU1## wherein A.sub.s denotes the amplitude, .omega..sub.s represents the frequency and .phi..sub.s represents the phase.
The basic sine wave analysis-by-synthesis framework will be described as follows. The analysis of the STC is based on the speech production model as shown in FIG. 1. Further details can be found in L. Raniner, "Digital Processing of Speech", Prentice-Hall, Englewood, Cliffs, N.J., 1978. In FIG. 1, The oscillation of the excitation can be presented by ##EQU2## Let Hg(.omega.) and H.sub.v (.omega.) indicate the glottis and vocal tract responses respectively. Therefore, the system function H.sub.s (.omega.) is indicated by the function (2): EQU H.sub.s (.omega.)=H.sub.g (.omega.)H.sub.v (.omega.)=A.sub.s (.omega.)exp[j.phi..sub.s (.omega.)] (2)
Consequently, each vocal wave form of the analysis frame can be denoted by ##EQU3## The vocal signal can be decomposed into a plurality of sine waves. Accordingly, the frequencies, phases, and amplitudes of the sine waves can also be composed to approximately form the initial vocal signal.
Turning to FIG. 2, it shows the sinusoidal analysis-synthesis module. First, the speech is input to a Hamming window 200 to obtain the frame for analysis. Then, the frame is transformed from time domain to frequency domain by discrete Fourier transform (DFT) 210. This has a benefit for short-time frequency analysis. Next, frequencies and amplitudes are found at peaks of the speech amplitude response by a peak picking method according to the absolute value of DFT output. Phase are then obtained by taking arc tangent (tan.sup.-1) 220 of the output of DFT 210 at all peaks. In the model of synthesis, the phase and frequency are operated by frame-to-frame unwrapping, interpolation and frame-to-frame frequency peaks birth-death matching and interpolation 250 to obtain the phase .theta.(n) of the frame. The amplitude is fed and frame-to-frame linear interpolation 255 is used to maintain continuity between the neighboring frames and obtaining the amplitude A(n). Then, the phase .theta.(n) and the amplitude A(n) are fed to sine wave generator 260, then sum all the sine wave 280, thereby composing the sine wave (synthesis speech output) consisting of each individual frame.
However, it can not meet the demand of the low bit rate coding by means of directly analyzing the amplitude, phase and frequency of each sine wave. Therefore, what is required is a model associated with phase, amplitude and frequency and the model uses less parameters for coding.
The description according to the model for the sine wave phase can be seen below. The STC constructs a sine wave phase model in order to reduce the coding bit for phase. The phase is divided into an excitation phase and a glottis, vocal tract phase response. Further, the phase residual of the voicing dependent model is adjusted in accordance with the voicing probability.
The excitation phase can be obtained via the onset time of excitation that can be estimated by vocal pitch. The phases of glottis and vocal tract can be calculated using the cepstral parameters by the posotulation of minimum phase. Thus, only the voicing probability (Pv) is needed to be coded and must be known to obtain phase residual. The voicing probability (Pv) occupies about 3 bits.
In the model for the sine wave frequency, all of the sine wave frequencies are regarded as a harmonic wave having fundamental frequency .omega..sub.0, the sine wave can be represented as follow. ##EQU4##
Thus, all of the frequencies of the sine wave can be obtain by coding only one pitch. The pitch occupies about 7 bits.
If the vocal signal is directly synthesized using fundamental frequency and harmonic wave, then the synthesized signal is sound disharmonic. One of the prior art relating to the issue is an article proposed by R. J. McAulay, T. F. Quatieri, "Pitch Estimation and Voicing Detection Based on a Sinusoidal Model", Proc. of IEEE Intrl. Conf. on Acoust., Speech, and Signal Processing, Albuquerque, pp. 249-252, 1990. The method can be seen briefly as follows.
step 1. defining the cut off frequency (.omega..sub.c) in accordance with the voicing probability (P.sub.v). .omega..sub.c (P.sub.v)=.pi.P.sub.v
step 2. defining the maximum sampling interval (.omega..sub.u) of the noise, the .omega..sub.u is about 100 Hz.
step 3. sampling
A. If the .omega..sub.0 is lower than .omega..sub.u, then the entire frequency spectrum is sampled as .omega..sub.0.
B. otherwise, the voicing that lower than .omega..sub.c is sampled .omega..sub.0. the noise that higher than .omega..sub.c is sampled as .omega..sub.u. ##EQU5## wherein k* is the maximum integer under the condition k*.omega..sub.0 .ltoreq..omega..sub.c (P.sub.v).
There are variety methods to overcome an issue relating to that the number of the sine ravage in each frame is not a constant number. A prior art uses a coding method relating to the cepstral representation to solve the problem. This can refer to the paper disclosed by J. McAulay, T. F. Quatieri. "Sinwave Amplitude Coding Using High-order Allpole Models", Proc. of EURSIP-94, pp. 395-398, 1991. Another method used the all-pole model for coding, which exhibits a certain number of amplitude in each frame. Please see the article proposed by T. F. Quatieri, R. J. McAulay, "Speech Transform Based on a Sinusoidal Representation", IEEE Trans. on Acoust., Speech, and Signal Process, ASSP-314:1449-1464, 1986 and a further article proposed by A. M. Kondoz, "INMARSAT-M:Quantization of Transform Components for Speech Coding at 1200 bps", IEEE Publication CD-ROM. 1991). Lupini used a vector quantization of harmonic magnitudes for speech coding. For example, P. Lupini, V. Cuperman, "Vector Quantization of Harmonic Magnitudes for Low-rates Speech Coders", Proc. of IEEE Globecom, San Francisco, pp. 165-208, 1992.
McAulay proposed that the cepstral should be used to represent the amplitude parameters in the sine wave transform coder. It exhibits the potential to develop the minimum phase model. It does not involve the calculation of the phase response of filters.
FIG. 3 is a scheme showing the 2.4 Kbps STC vocoder in accordance with McAulay. The speech is analyzed by Hamming window 300 to obtain the analyzed speech frame. After the speech frame is transformed via fast fourier transform (FFT) 310, the speech frame is estimated by pitch estimate 320 and pre-process 330 (spectrum envelope estimation vocoder; SEEVOC) to obtain the sine wave amplitude envelope. The SEEVOC can achieve the sine wave amplitude envelope. Then, the signal is calculated by using the tools relating to the cepstral coefficient 340 and cosine transformation thereby obtain a group of channel gains that represents the amplitude. Next, the channel gains are fed to DPCM 360 for quantization. Then, the quantified channel gains are quantized by means of scalar quantization in accordance with the voicing probability 365 and the pitch estimation.
In synthesis, the quantized channel gains are processed by inverse DPCM 360a, cosine transformation 350a, for achieving the cepstral parameters. Subsequently, the cepstral parameters are transformed by inverse cepstral 340a from cepstral parameters to spectrum envelope 330a. The harmonic wave amplitude 320a can be achieved by synthesizing the spectrum envelope 330a and the harmonic wave frequency of the pitch. The phase 315a for the synthesized signal is generated by three major portions. First, the phase component of glottis and vocal tract system is obtained by cepstral. Further, the phase component of the excitation can be obtained from pitch. The third, the phase residual is calculated from the voicing probability. The obtained amplitude, phase, frequency have to match with the frame-to-frame matching 310a that includes the birth-death matching, linear interpolation for synthesizing the speech, thereby keeping the continuation of signals between the neighboring speech frames. Finally, the synthesized speech is output after the step of synthesis 305a.
Turning to FIG. 4, it shows the method of amplitude coding of McAulay in accordance with FIG. 3. The speech signals of each the speech frame are initially transformed to short time spectrum domain by means of FFT 310. Then, speech signal is performed by SEEVOC 330 to obtain the sine wave amplitude envelope. Next, the linear interpolation 400, spectral warping 410 and low pass filter 420, cepstral 340 are respectively used to get the cepstral parameters for achieving the purpose of low bit rate quantization.
Subsequently, the cepstral parameters are transformed by using cosine transformation to obtain the channel gains. Next step is quantization. In order to achieve this purpose, DPCM or vector quantization can be used. The quality of the synthesized signal is not bad by using the aforesaid method. However, the tone is sound not only low but also heavy. MsAulay added a post filter adjacent to the receiver to solve this problem. The decoding method involves the inverse procedures of the aforementioned steps. Apparently, inverse DPCM 360a, cosine transform 350a, inverse cepstral 340a, post filter 420a are used to get the cepstral parameters. Then, post filter 420a is introduced to eliminate the problem related to the tone is sound too low and heavy. The processed signal is subsequently fed to inverse spectral warping 410a, and harmonic sampling 405. Finally, the synthesized speech is output after synthesis.
The major portion of the quantization bits are used for amplitude quantization. Therefore, the quality of the synthesized speech is primarily depending on the fidelity of the amplitude quantization. Although the conventional sine wave coding has been improved by McAulay by using frequency warping. However, the issue associated with the sound pressure level is still under developed.