Our invention relates to the artificial production of speech or other complex waves and, more particularly, to the synthesis of speech signals from parametric description codes representative of selected speech characteristics.
The production of good-quality synthetic speech is of interest in speech research, in communication systems where conservation of bandwidth is important, and in computer and related systems where voice output is desired. One known speech synthesis technique is based on the fact that an applied speech wave at any instance of time is a weighted sum of its past values whereby speech parameter signals can be developed which specify the linearly predictable characteristics of a speech signal. The parameter signals are utilized to control a discrete linear time-varying filter which is excited by a suitable combination of quasi-periodic pulses and white noise. The quasi-periodic pulses result in voiced excitation, while the white noise results in unvoiced excitation. An excitation adjustment amplifier is interposed between the excitation source and the filter, and the gain of the amplifier is controlled in accordance with the mean squared power of each segment of the speech signal to provide natural-sounding speech.
The speech signal comprises a sequence of pitch period segmented speech samples. In any speech segment, the n.sup.th speech sample comprises a first component representative of the contribution of the memory of the prediction filter carried over from previous speech segments and a second component contributed by the excitation in the current speech segment. The gain of the excitation amplifier is adjusted to account for the presence of the overhang energy from the previous speech segment on the basis of the aforementioned first and second components and the mean squared value of the current speech segment. Thus, it is necessary to generate the excitation level adjustment factor of the current speech segment prior to the formation of the speech samples of the current speech segment. The information required for the generation of the overhang energy adjustment factor, however, includes the speech samples being formed in the current speech segment.
In vocoder-type systems utilizing linear predictive coding, coded signals representative of the predictive parameters of a pitch period of an applied speech signal, the pitch period of the speech signal, and the overhang energy adjustment factor (excitation level adjustment signal) are produced in a speech analyzer responsive to the applied speech signal. The coded signals are then transmitted to a speech synthesizer of the aforementioned type to control the generation of speech samples. The resulting speech samples are applied to a low-pass filter from which a replica of the applied speech signal is obtained. Such an arrangement is disclosed in U.S. Pat. No. 3,624,302, issued to B. S. Atal on Nov. 30, 1971.
While the overhang energy adjustment factor in the aforementioned Atal patent is produced prior to synthesis, it is often preferred to generate the overhang energy adjustment factor in the speech synthesizer. One arrangement in which this is done is shown in U.S. Pat. No. 3,715,512, issued to J. M. Kelly on Dec. 20, 1971. The Kelly arrangement requires that a frequency compressed auxiliary spectral envelope of an applied speech signal be generated via a linear predictive analyzer and synthesizer at a substantially reduced excitation rate to achieve economies in bandwidth. An overhang energy adjustment factor computer is included which is responsive to the RMS value of the pitch period speech signal, the prediction parameters of the pitch period, and a prescribed set of speech samples from the just-concluded pitch period to generate the excitation level (overhang energy) adjustment factor for use in the reduced rate synthesizer. Since the adjustment factor computer is operative at the real-time sampling rate while the speech synthesizer operates at a substantially lower rate, the adjustment factor can be readily computed and made available to provide the necessary gain modification of the synthesizer excitation amplifier. In reconstructing the speech signal from the frequency compressed auxiliary spectral envelope in a second speech synthesizer, however, the prediction parameters are available only at the substantially lower excitation rate whereby the overhang energy adjustment factor can only be modified at the lower rate. Since the adjustment factor cannot be modified at the pitch period excitation rate, the adjustment factor is incorrect for a substantial number of pitch periods and the resulting speech signal replica is not accurate.
In vocoder systems such as in aforementioned U.S. Pat. No. 3,624,302, the speech synthesizer is operative to resynthesize a speech signal after linear prediction analysis so that hangover energy adjustment may be readily formed in the vocoder speech analyzer. Where the speech synthesizer is operative at lower than real-time rates such as in aforementioned U.S. Pat. No. 3,715,512, it is relatively simple to compute the hangover energy adjustment factor prior to the formation of the low rate speech samples. In some linear prediction synthesizers operative at real-time pitch period rates, however, the hangover energy adjustment factor must be formed during synthesis. This is the case, for example, in synthesis by rule systems in which an artificial speech signal is produced responsive to stored phonetic descriptive codes. While it is theoretically possible to generate the hangover energy adjustment factor for a pitch period prior to the formation of the first speech sample of said pitch period, it is generally impractical to do so at real time pitch period rates because of the large number of processing steps required and the limited time available. It is an object of the invention to provide speech synthesis on the basis of segmented parametric description codes at pitch period rates in an economical manner.