The present invention relates to a linear prediction process, and corresponding apparatus, for reducing the redundance in the digital processing of speech. It is particularly directed to a speech processing system in which the speech signal is analysed to determine parameters relating to a model speech filter, pitch and volume.
Speech processing systems of this type, so-called LPC vocoders, afford a substantial reduction in redundance in the digital transmission of voice signals. They are becoming increasingly popular and are the subject of numerous publications, representative examples of which include:
B. S. Atal and S. L. Hanauer, Journal Acoust. Soc. A., 50, pp. 637-655, 1971; PA0 R. W. Schafer and L. R. Rabiner, Proc. IEEE, Vol. 63, No. 4, pp. 662-667, 1975; PA0 L. R. Rabiner et al., Trans. Acoustics, Speech and Signal Proc., Vol. 24, No. 5, pp. 399-418, 1976; PA0 B. Gold. IEEE Vol. 65, No. 12, pp. 1636-1658, 1977; PA0 A. Kurematsu et al., Proc. IEEE, ICASSP, Washington 1979, pp. 69-72; PA0 S. Horwath, "LPC-Vocoders, State of Development and Outlook", Collected Volume of Symposium Papers "War in the Ether", No. XVII, Bern 1978; PA0 U.S. Pat. Nos.: 3,624,302--3,361,520--3,909,533--4,230,905.
Presently known and available LPC vocoders do not operate in a fully satisfactory manner. Even though the speech that is synthesized after analysis is in most cases relatively comprehensible, it is distorted and sounds artificial. A principle cause of this condition, among others, is the difficulty in deciding with adequate security whether a voiced or unvoiced speech section is present. Further causes are the inadequate determination of the pitch period and the inaccurate determination of the sound forming filter parameters.
The present invention is primarily concerned with the first of these difficulties and has as its object the improvement of a digital speech synthesizing process and system of the previously described type, to provide a correct and secure voiced/unvoiced decision and thus an improvement in the quality of synthesized speech.
A series of decision criteria are used for the voiced/unvoiced classification and are applied individually or partly in combination. Conventional criteria include, for example, the energy of the speech signal, the number of zero transitions of the signal within a given period of time, the standardized residual error energy, i.e. the ratio of the energy of the prediction error signal to that of the speech signal, and the magnitude.of the second maximum of the autocorrelation function of the speech signal or of the prediction error signal. It is also customary to effect a transverse comparison with one or several adjacent speech sections. A clear and comparative representation of the most important classification criteria and methods can be found, for example, in the aforecited reference by L. R. Rabiner et al.
A common characteristic of all of these known methods and criteria is that bilateral decisions are always made in the sense that the speech section is invariably and definitively classified according to one or the other possibility depending whether the pertinent criterion or criteria are satisfied. Even though it is possible to achieve a relatively high accuracy with a suitable selection or combination of decision criteria in this manner, actual practice shows that erroneous decisions still occur with a relatively high frequency and that they affect the quality of the synthesized speech to a significant degree. A main cause for this error is that the speech signals in general are of a varying character in spite of all redundance, so that it is simply not possible to establish criteria decision thresholds for making a secure statement in both directions. A certain degree of uncertainty remains and must be accepted.