The present invention relates to a pattern matching vocoder and, more particularly, to an LSP pattern matching vocoder.
An LSP (Line Spectrum Pairs) pattern matching vocoder is a typical example of a pattern matching vocoder for comparing a reference voice pattern with a distribution pattern of spectral envelopes of input speech, causing an analyzer unit to send to a synthesizer unit a best matching reference pattern (i.e., label data of a reference pattern with a minimum spectral distortion) as spectral envelope data together with exciting source data, and for causing the synthesizer unit to synthesize speech by detecting the spectral envelope data as speed synthesis filter coefficients according to the label of the reference pattern.
In a conventional pattern matching vocoder, a label of the best matching reference pattern is sent in place of the spectral envelope data to greatly decrease the transmission data. In order to minimize the spectral distortion generated as a matching error, a weighting coefficient is added to each vector element for matching a reference pattern and input speech.
In a conventional basic LSP pattern matching vocoder, matching between the input speech and a reference pattern is performed for each analysis frame using as a matching measure a spectral distance D.sub.ij given in equation (1) below: ##EQU1## where S.sub.i (.omega.) and S.sub.j (.omega.) are logarithmic spectra of frames i and j, P.sub.k.sup.(i) and P.sub.k.sup.(i) are LSP coefficients of Mth order, and W.sub.k is a weighting coefficient added to each of the first-to Mth-order LSP coefficients and is generally represented by spectrum sensitivity.
The approximation in equation (1) is normally used which requires a smaller number of calculations. In this case, the number of vector elements is M.
Pattern matching is normally performed to select a minimum D.sub.ij, i.e., a spectral distortion obtained by calculating a difference between two vector elements of input speech and a reference pattern, squaring each difference, multiplying by weight coefficient, and adding the weighted squared differences. Different weight coefficients are multiplied to the different vector elements to minimize the spectral distortion.
The conventional LSP pattern matching vocoder has the following drawbacks.
(1) The reference vector patterns in the analyzer unit and the synthesizer unit in the LSP pattern matching vocoder are patterns clustered by a spectral equidistance. The input speech signal is synthesized by matching these reference vector patterns with LSP coefficient vector patterns extracted from the input speech.
However, the frequency of occurrence of the conventional reference vector pattern does not linearly correspond to that of the LSP coefficient vectors in a vector space. When the clustered reference vector pattern groups are matched with the LSP patterns at the spectral equidistance by neglecting the above condition, magnitudes of differences therebetween cannot be greatly minimized. In other words, quantization distortions in pattern matching have lower limits.
(2) In a conventional pattern matching vocoder, a sum of the squares of the differences between vector elements of the reference pattern and the input speech is used as a matching measure. The spectral sensitivity corresponding to this weighting coefficient represents a spectral change corresponding to a small change in spectral envelope and is preset on the basis of speech information in advance.
Weighting utilizing such spectral sensitivity is defined as a scheme for providing the spectral envelope with a uniform change corresponding to weighting. Therefore, pole conditions (i.e., center frequency and bandwidth) largely associated with hearing are not separated from the speech and are processed together. The "pole" is a solution for setting zero A.sub.p (Z.sup.-1) in transfer function (2) of a tracheal filter realized by an all-pole digital filter: EQU H(Z).sup.-1 =1/A.sub.p (Z.sup.-1) (2) EQU for A.sub.p (Z.sup.-1 =1+.alpha..sub.1 Z.sup.-1 +.alpha..sub.2 Z.sup.-2. . . +.alpha..sub.p Z.sup.-p
where Z=exp(j.lambda.), .lambda.=2.pi..DELTA.Tf, .DELTA.T is a sampling cycle, f is a frequency, p is the order of the digital filters, and .alpha..sub.1 to .alpha..sub.p are pth-order LPC coefficients as control parameters of the all-pole digital filter.
However, hearing sensitivity is more susceptive to a change in center frequency than to a change in pole bandwidth. Therefore, a scheme for uniformly evaluating and weighting spectral distortion using the spectral sensitivity is not plausible in principle.
(3) A bandsplitting vocoder is known which performs LPC (Linear Prediction Coefficient) analysis for each of a plurality of ranges obtained by dividing a frequency band of an input speech signal. The vocoder of this type eliminates two drawbacks inherent to LSP analysis. First, the formant range is underestimated. Second, a higher-order formant with small energy, e.g., a formant of third order, has poor approximate characteristics as compared with the formant of first order. These two drawbacks are estimated to be caused by excessive concentration of poles in a frequency region concentrated with energy from the formant of first order. In order to prevent the poles from being concentrated in a specific frequency region, the bandsplitting vocoder divides the frequency band into a plurality of frequency regions each of which is subjected to LPC analysis, thereby eliminating the above two drawbacks. In this case, when the frequency band is divided into a large number of frequency regions, the respective frequency regions tend to have uniform energy profiles, and band compression of the input speech signal is not effected at all. In general, the frequency band is divided into two to four frequency regions. The split frequency regions need not be at equal intervals, but are determined at a logarithmic ratio such that formants as poles of spectral envelopes are respectively included in the frequency regions. However, in the bandsplitting vocoder of this type, discontinuity occurs in the interband spectrum of the synthesizer unit in the vocoder, thus degrading the quality of synthesized sounds.
(4) Instead of matching reference patterns with the input speech vectors and sending each selected reference pattern for each corresponding analysis frame, L reference patterns corresponding to L representative analysis frames extracted for each section consisting of continuous K analysis frames are selected, and, together with the L reference patterns, are sent with a reference pattern number, i.e., a repeat bit from the analyzer unit, to the synthesizer unit in the vocoder. Thus, the reference patterns selected for each section are sent together with an optimal reference pattern label of the representative analysis frames for each section. In other words, the designation code is sent together with the repeat bit to the synthesizer unit in the vocoder. The representative analysis frames for each section are obtained by approximating the spectral envelope parameter profile of all analysis frames with an optimal approximation function. The optimal approximation function can be a rectangular, trapezoidal or linear approximation function in accordance with a given application of the vocoder. In normal operation, the proper function is selected by DP method.
When an optimal approximation is performed using a rectangular approximation function, the contents of the K analysis frames for each section are expressed by the contents of the L analysis frames constituting the rectangular function and the analysis frame numbers respectively represented thereby.
In a conventional variable frame length pattern matching vocoder of this type, selection of representative frames for constituting a variable length frame and selection of reference patterns by pattern matching are independently performed. The spectral distortion generated during pattern matching, i.e., quantization distortion and so-called time distortion on the basis of a difference between spectral distances upon substituting the frames with the representative frames, are therefore independently included. In this state, speech analysis and synthesis are performed, thus inevitably degrading the quality of synthesized sounds.