1. Field of the Invention
The present invention relates to a speech coding apparatus and a pitch prediction method in speech coding, particularly a speech coding apparatus using a pitch prediction method in which pitch information concerning an input excitation waveform for speech coding is obtained as few computations as possible, and a pitch prediction method of an input speech signal.
2. Description of the Related Art
A speech coding method represented by CELP (Code Excited Linear Prediction) system is performed by modelimg the speech information using a speech waveform and an excitation waveform, and coding the spectrum envelop information corresponding to the speech waveform, and the pitch information corresponding to the excitation waveform separately, both of which are extracted from input speech information divided into frames.
As a method to perform such speech coding at a low bit rate, recently ITU-T/G.723.1 was recommended. The coding according to G.723.1 is carried out based on the principles of linear prediction analysis-by-synthesis to attempt so that a perceptually weighted error signal is minimized. The search of pitch information in this case is performed by using the characteristics that a speech waveform changes periodically in a vowel range corresponding to the vibration of a vocal cord, which is called pitch prediction.
An explanation is given to a pitch prediction method applied in a conventional speech coding apparatus with reference to FIG 1. FIG. 1 is a block diagram of a pitch prediction section in a conventional speech coding apparatus.
An input speech signal is processed to be divided into frames and sub-frames. An excitation pulse sequence X[n] generated in a immediately before sub-frame is input to pitch reproduction processing section 1, and processed by the pitch emphasis processing for a current target sub-frame.
Linear predictive synthesis filter 2 provides at multiplier 3 the system filter processing such as formant processing and harmonic shaping processing to an output speech data Y[n] from pitch reproduction processing section 1.
The coefficient setting of this linear predictive synthesis filter 2 is performed using a linear predictive coefficient A'(z) normalized by the LSP (linear spectrum pair) quantization of a linear predictive coefficient A(z) obtained by linear predictive analyzing a speech input signal y[n], a perceptual weighting coefficient W[z] used in perceptual weighting processing the input speech signal y[n], and a coefficient P(z) signal of harmonic noise filter for waveform arranging a perceptually weighted signal.
Pitch predictive filter 4 is a filter with five taps for providing in multiplier 5 the filter processing to an output data t'[n] out put from multiplier 3 using a predetermined coefficient. This coefficient setting is performed by reading out a codeword sequentially from adaptive codebook 6 in which a codeword of adaptive vector corresponding to each pitch period is stored. Further when coded speech data are decoded, this pitch predictive filter 4 has the function to generate a pitch period which sounds more natural and similar to a human speech in generating a current excitation pulse sequence from a previous excitation pulse sequence.
Further adder 7 outputs an error signal r[n]. The error signal r[n] is an error between an output data p[n] from multiplier 5 that is a pitch predictive filtering processed signal, and a pitch residual signal t[n] of a current sub-frame (a residual signal of the formant processing and the harmonic shaping processing). An index in adaptive codebook 6 and a pitch length are obtained as the optimal pitch information so that the error signal r[n] should be minimized by the least squares method.
The calculation processing in a pitch prediction method described above is performed in the following way.
First the calculation processing of pitch reproduction performed in pitch reproduction processing section 2 is explained briefly using FIG. 1.
The excitation pulse sequence X[n] of a certain pitch is sequentially input to a buffer to which 145 samples can be input, then the pitch reproduced excitation sequence Y[n] of 64 samples are obtained according to equations (1) and (2) below, where Lag indicates a pitch period.
Y(n)=X(145-Lag-2+n) n=0,1 (1) EQU Y(n)=X(145-Lag+(n-2)%Lag) n=2-63 (2)
That is, equations (1) and (2) indicate that a current pitch information (vocal cord vibration) is imitated using a previous excitation pulse sequence.
Further, the convolution data (filtered data) t'[n] is obtained by the convolution of this pitch reproduced excitation sequence Y[n] and an output from linear predictive synthesis filter 2 according to equation (3) below. ##EQU1##
And, since the pitch prediction processing is performed using a pitch predictive filter in fifth order FIR (finitive impulse response) type, five convolution data t'[n] are necessary from Lag-2 up to Lag+2 as shown in equation (4) below, where Lag is a current pitch period.
Because of the processing, as shown in FIG. 2, the pitch reproduced excitation data Y[n] requires 64 samples which are 4 samples (from Lag-2 up to Lag+2 suggests total 4 samples) more than 60 samples forming a sub-frames, ##EQU2##
where l is a variable of two dimensional matrix, which indicates the processing is repeated five times.
However, as a method to reduce calculations in a DSP or the like, convolution data t'(4)(n) is obtained using equation (3) when l=4, and obtained using equation (5) below when l=0.about.3. EQU t'(l)(n)=I(l).multidot.Y(n)+t'(l+1)(n-1) 0.ltoreq.l.ltoreq.3 0.ltoreq.n.ltoreq.59 (5)
By using equation (5), 60 times of convolution processing are enough, while 1,830 times of convolution processing are required without using equation (5).
Further the optimal value of convolution data P(n) in pitch predictive filter 4 is obtained using pitch residual signal t(n) so that the error signal r(n) should be minimized. In other words, the error signal r(n) shown in equation (6) below should be minimized by searching adaptive codebook data of pitches corresponding to Live filter coefficients of fifth order FIR type pitch predictive filter 4 from codebook 6. EQU r(n)=t(n)-p(n) (6)
The estimation of error is obtained using the least squares method according to equation (7) below. ##EQU3##
Accordingly, equation (8) below is given. ##EQU4##
Further, equation (9) below is given. ##EQU5##
By substituting equation 9 in equation 9, adaptive codebook data of a pitch, in other words, the index of adaptive codebook data of a pitch to minimize the error is obtained.
Further pitch information that is closed loop pitch information and the index of adaptive code book data of a pitch are obtained by repeating the above operation corresponding to Lag-1 up to Lag+1 for the re-search so as to obtain the pitch period information at this time correctly. The number of re-search times is determined by the setting of k parameter. In the case of repeating a pitch prediction according to the order of Lag-1, Lag, and Lag+1, k is set at 2 (0,1 and 2). (In the case of k=2, the number of repeating times is 3.)
The further processing is provided to each sub-frame. The re-search range of a pitch period for an even-numbered sub-frame is from Lag-1 to Lag+1, which sets k=2 (the number of repeating times is 3). The re-search range of a pitch period for an odd-numbered sub-frame is from Lag-1 to Lag+2, which sets k=3 (the number of repeating times is 4). The pitch search processing is performed according to the range described above, and since one frame is composed of four sub-frames, the same processing is repeated four times in one frame.
However in the constitution according to the prior art described above, since the convolution processing shown in equation 4 is necessary each time of the pitch reproduction processing, the required number of convolution processing times in one frame is 14 (3+4+3+4) that is the total amount suggested by the k parameter. That brings the problem that the computations are increased in the case where the processing is performed in DSP (CPU).
And it is necessary to repeat the pitch reproduction processing at the number of times corresponding to the k parameter. That also brings the problem that the computations are increased in the case where the processing is performed in DSP (CPU).