This invention relates to electronic processing of speech, and similar one-dimensional signals.
Processing of speech signals corresponds to a very large field. It includes encoding of speech signals, decoding of speech signals, filtering of speech signals, interpolating of speech signals, synthesizing of speech signals, etc. In connection with speech signals, this invention relates primarily to processing speech signals that call for time scaling, interpolating and smoothing of speech signals.
It is well known that speech can be synthesized by concatenating speech units that are selected from a large store of speech units. The selection is made in accordance with various techniques and associated algorithms. Since the number of stored speech units that are available for selection is limited, a synthesized speech that derived from a catenation of speech units typically requires some modifications, such as smoothing, in order to achieve a speech that sounds continuous and natural. In various applications, time scaling of the entire synthesized speech segment or of some of the speech units is required. Time scaling and smoothing is also sometimes required when a speech signal is interpolated.
Simple and flexible time domain techniques have been proposed for time scaling of speech signals. See, for example, E. Moulines and W. Verhelst, xe2x80x9cTime Domain and Frequency Domain Techniques for Prosodic Modification of Speechxe2x80x9d, in Speech Coding and Synthesis, pp. 519-555, Elsevier, 1995, and W. Verhelst and M Roelands, xe2x80x9cAn overlap-add techniques based on waveform similarity (WSOLA) for high quality time-scale modification of speechxe2x80x9d, Proc. IEEE ICASSP-93, pp. 554-557, 1993.
What has been found is that the quality of time-scaled signal is good for time-scaling factors close to one, but a degradation of the signal is perceived when larger modification factors are required. The degradation is mostly perceived as tonalities and artifacts in the stretched signal. These tonalities do not occur everywhere in the signal. We found that the degradations are mostly localized in areas of transitions of speech, often at the junction of concatenation speech units.
We discovered that the aforementioned artifacts problem is related to the level of stationarity of the speech signal within a small interval, or window. In particular, we discovered that speech signals portions that are highly non-stationary cause artifacts when they scaled and/or smoothed. We concluded, therefore, that the level of non-stationarity of the speech signal is a useful parameter to employ when performing time scaling of synthesized speech and that, in general, it is not desirable to modify or smooth highly non-stationary areas of speech, because doing so introduces artifacts in the resulting signal. To that end, a measure of the speech signal""s non-stationarity must be developed.
A simple yet useful indicator of non-stationarity is provided by the transition rate of the RMS value of the speech signal. Another measure of non-stationarity that is useful for controlling time scaling of the speech signal is the transition rate of spectral parameters, normalized to lie between 0 and 1. A more improved measure of non-stationarity that is useful for controlling time scaling of the speech signal is provided by a combination of the transition rates of the RMS value of the speech signal and the LSFs, normalized to lie between 0 and 1.