This invention relates to electronic processing of speech, and similar one-dimensional signals.
Processing of speech signals corresponds to a very large field. It includes encoding of speech signals, decoding of speech signals, filtering of speech signals, interpolating of speech signals, synthesizing of speech signals, etc. In connection with speech signals, this invention relates primarily to processing speech signals that call for time scaling, interpolating and smoothing of speech signals.
It is well known that speech can be synthesized by concatenating speech units that are selected from a large store of speech units. The selection is made in accordance with various techniques and associated algorithms. Since the number of stored speech units that are available for selection is limited, a synthesized speech that is derived from a catenation of speech units typically requires some modifications, such as smoothing, in order to achieve a speech that sounds continuous and natural. In various applications, time scaling of the entire synthesized speech segment or of some of the speech units is required. Time scaling and smoothing is also sometimes required when a speech signal is interpolated.
Simple and flexible time domain techniques have been proposed for time scaling of speech signals. See, for example, E. Moulines and W. Verhelst, "Time Domain and Frequency Domain Techniques for Prosodic Modification of Speech", in Speech Coding and Synthesis, pp. 519-555, Elsevier, 1995, and W. Verhelst and M Roelands, "An overlap-add techniques based on waveforn similarity (WSOLA) for high quality time-scale modification of speech", Proc. IEEE ICASSP-93, pp. 554-557, 1993.
What has been found is that the quality of time-scaled signal is good for time-scaling factors close to one, but a degradation of the signal is perceived when larger modification factors are required. The degradation is mostly perceived as tonalities and artifacts in the stretched signal. These tonalities do not occur everywhere in the signal. We found that the degradations are mostly localized in areas of transitions of speech, often at the junction of concatenation speech units.