This invention relates to a multi-pulse type vocoder.
There is known a type of vocoder which analyzes an input speech signal to extract, at the analysis side, spectrum envelope information and excitation source information, and reproduces the input speech signal, on the synthesis side, on the basis of this speech information transmitted through a transmission line.
The spectrum envelope information represents spectrum distribution information of the vocal track and is normally expressed by an LPC coefficient such as the .alpha. parameter and K parameter. The excitation source information indicates a microstructure of the spectrum envelope and is known as the residual signal obtained through removing the spectrum distribution information from the input speech signal, including strength of an excitation source, pitch period and voiced-unvoiced information of the input speech signal. The spectrum envelope information and the excitation source information are utilized as a coefficient and an excitation source for the LPC synthesizer based on an all-pole type digital filter.
A conventional LPC vocoder is capable of synthesizing speech even at a low bit rate of about 4 Kb or below. However, high quality speech synthesis is hard to attain even at high bit rates due to the following reason. In the conventional vocoder, a voiced sound is approximated in a single impulse train corresponding to the pitch period extracted on the analysis side. An unvoiced sound is also approximated as white noise at a random period. Therefore, the excitation source information of an input speech signal is not extracted conscientiously; that is, the waveform information of the input speech signal is not practically extracted.
The recently developed multi-pulse type vocoder carries out an analysis and a synthesis based on waveform information in order to eliminate the above problem. For more information on the multi-pulse type vocoder, reference is made to the report by Bishnu S. Atal and Joel R. Remde, "A NEW MODEL OF LPC EXCITATION FOR PRODUCING NATURAL-SOUNDING SPEECH AT LOW BIT RATES", PROC. ICASSP 82, pp. 614 to 617 (1982).
In this vocoder, an excitation source series is expressed by a multi-pulse excitation source consisting of a plurality of impulse series (multi-pulse). The multi-pulse is developed through the so-called A-b-S (Analysis-by-Synthesis) procedure which will be briefly described hereinafter.
The LPC coefficient of an input speech signal X(n) obtainable at each of the analysis frames is supplied as the filter coefficient of the LPC synthesizer (digital filter). An excitation source series V(n) consisting of a plurality of impulse series, namely a multi-pulse, is supplied to the LPC synthesizer as the excitation source. Then, the difference between a synthesized signal X(n) obtained in the LPC synthesizer and the input speech signal X(n), i.e. an error signal e(n), is obtained using a subtracter. Thereafter an aural weighting factor is applied to the error signal in an aural weighter. Next, the excitation source series V(n) is determined in a square error minimizer so that a cumulative square sum (square error) of the weighted error signal in the frame will be minimized. Such a multi-pulse determination according to the A-b-S procedure is repeated for each pulse, thus determining optimum position and amplitude of the multi-pulse.
The multi-pulse type vocoder described above may realize a high quality speech synthesis using low-bit transmission. However, the number of arithmetic operations is unavoidably huge due to the A-b-S procedure.
In view of the above situation, a procedure for efficiently calculating an optimum multi-pulse according to a correlation operation has been proposed. Reference is made to a report by K. Ozawa, T. Araseki and S. Ono, "EXAMINATION ON MULTI-PULSE DRIVING SPEECH CODING PROCEDURE", Meeting for Study on Communication System, Institute of Electronics and Communication Engineers of Japan, Mar. 23, 1983, CAS82-202, CS82-161. Further, the technique is disclosed in U.S. patent application Ser. No. 565,804 filed Dec. 27, 1983 by Kazumori Ozawa et al, assignors to the present assignee. An algorithm of this procedure is as follows:
Assuming now an excitation source pulse is present in k pieces in one analysis frame, the first pulse is at a time position m.sub.i from the frame end, and its amplitude is g.sub.i . Then an excitation source d(n) of the LPC synthesis filter is given by the following expression (1): ##EQU1## where .delta..sub.n, m.sub.i are Kronecker's delta functions, and .delta..sub.n, m.sub.i =1 (n =m.sub.1), .delta..sub.n, m.sub.i =0 (n.noteq.m.sub.i).
LPC synthesis filter is driven by the excitation source d(n) and outputs a synthesis signal x(n). For example, an all-pole digital filter may be used as the LPC synthesis filter, and when its transmission function is expressed by an impulse response h(n) (1.ltoreq.n.gtoreq.N.sub.h), where N.sub.h is a predetermined number, the synthesis signal x(n) can be given by the following expression. ##EQU2## where N denotes the last number of sample numbers in the analysis frame, and d(l) denotes the l-the pulse of d(n) in the expression (1).
Next, a weighted error e.sub.w (n) obtained through applying the aural weighting to the error between the signals x(n) and x(n) will be indicated by the expression (3). EQU e.sub.w (n) ={x(n)-x(n)}w(n) (3)
Further, the square error can be indicated by the expression (4) by using the expression (3). ##EQU3##
The multi-pulse as an optimum excitation source pulse series is obtainable by obtaining g.sub.i which minimizes the expression (4), and g.sub.i is derived from the following expression (5) from the above expressions (1), (2) and (4). ##EQU4## where x.sub.w (n) indicates x(n) x w(n), and h.sub.w (n) indicates h(n)x w(n). The first term of the numerator on the right side of the expression (5) indicates a cross-correlation function .phi..sub.hx (m.sub.i) at time lag m.sub.i between x.sub.w (n) and h.sub.w (n), and ##EQU5## of the second term indicates a covariance function .phi..sub.hh (m.sub.l, m.sub.i) (1.ltoreq.m.sub.l, m.sub.i .ltoreq.N) of h.sub.w (n). The covariance function .phi..sub.hh (m.sub.l, m.sub.i) is equal to an autocorrelation function R.sub.hh (.vertline.m.sub.l =m.sub.i .vertline.). Therefore, expression (5) can be represented by the following expression (6). ##EQU6##
According to the expression (6), the i-th multi-pulse will be determined as a function of a maximum value and a time position of g.sub.i (m.sub.i).
According to such algorithm, the multi-pulse can be developed through the calculation of the cross-correlation function and autocorrelation function. Therefore, it can be substantially simplified, and the number of arithmetic operations can be decreased sharply.
Be that as it may, this improved multi-pulse type vocoder is still not free from the following problems.
In this algorithm, where the cross-correlation function .phi..sub.hx (m.sub.i) and the autocorrelation function R.sub.hh are largely different in form at the time point, m.sub.i, .phi.(m.sub.i) does not necessarily decrease optimally, the pulse number increases unnecessarily in consequence, and the coding efficiency deteriorates.
According to the above-described algorithm, time position and amplitude of the multi-pulse are determined through the following procedure. First, the cross-correlation function .phi..sub.hx (m.sub.i) between the input signal and the impulse response and the autocorrelation function R.sub.hh of the impulse response are developed. With a position of the first pulse constituting the multi-pulse at the time position m.sub.i whereat the absolute value of a waveform .phi..sub.hx (m.sub.i) thus obtained is maximized, the pulse amplitude is determined as a value .phi..sub.hx (m.sub.1) of .phi..sub.hx (m.sub.i) at the time position m.sub.1. Next, an influential component due to the first pulse is removed from the waveform of .phi..sub.hx (m.sub.i). This operation implies that the waveform of R.sub.hh (normalized) is multiplied by .phi..sub.hx (m.sub.1) around the time position m.sub.1 and then subtracted from the waveform of .phi..sub.hx (m.sub.i). After the waveform of the correlation function in which the influential component due to the first pulse is removed, is thus obtained, the second position and amplitude are determined based on the waveform as in the above procedure. Thus, positions and amplitudes of the third, fourth, ...., l-th pulses are obtained through repeating such operation.
As described, according to the above correlation operation the influence of the pulse obtained prior thereto is removed by subtracting the autocorrelation function waveform R.sub.hh from the cross-correlation function waveform .phi..sub.hx. However, the waveform of .phi..sub.hx (m.sub.i) and the waveform of R.sub.hh of each pulse at the time position are not necessarily analogous with each other, which may exert an influence on other waveform portion of .phi..sub.hx (m.sub.i) through subtraction. Therefore, an unnecessary pulse is capable of being determined as one of the multi-pulses, thus preventing an optimum information compression.
In a conventional vocoder, the number of the multi-pulses in one frame is predetermined to be between 4 and 16 on the basis of the bit rate. However, the pitch period of the female voice or the infant voice is relatively short, for example 2.5 mSEC. In this case when the frame period is 20 mSEC, the number of multi-pulses to be set in one frame must be at least eight. In such a case, where the number of pulses to be generated in the analysis frame is set at four, a synthesized speech includes a double pitch error, which may deteriorate the synthesized tone quality considerably. That is to say, the synthesized signal in this case is not regarded as conscientiously carried out based on the waveform information. Therefore, the tone quality of the synthesized speech involves a deterioration corresponding to the difference in pulse number as described.