1. Field of the Invention
The present invention relates to a method of and an apparatus for performing time-scale modification of a speech signal, whereby the time duration of the speech signal is changed without changing the fundamental frequency components of the speech signal.
2. Description of the Related Art
Conventionally, in order to playback a speech signal recorded on audio tapes or the like at a higher speed or a lower speed for listeners, a speech time modification apparatus has been utilized.
One such speech time-scale modification apparatus is disclosed in U.S. Pat. No. 3,786,195, "VARIABLE DELAY LINE SIGNAL PROCESSOR FOR SOUND REPRODUCTION." This speech time-scale modification apparatus includes a variable delay line, a ramp level and amplitude changer, a blanking circuit, a blanking pulse generator, and a ramp pulse-train generator.
The operation of the speech time-scale modification apparatus having the above configuration will be described below.
First, an input signal is written into the variable delay line. Next, the ramp pulse-train generator controls the ramp level and amplitude changer and the blanking pulse generator in accordance with the time-scale modification ratio. The ramp level and amplitude changer then reads the input signal from the variable delay line at a speed which is different from a speed in writing in accordance with the time-scale modification ratio. Specifically, for a playback of a speech signal at a higher speed, reading is done at a lower rate than writing, and for a playback of a speech signal at a lower speed, reading is done at a higher rate than writing. At discontinuous portions between blocks, the blanking circuit applies the muting action to the output of the variable delay line.
With the above configuration, however, problems arise when the speed is increased; that is, the recognizability of consonants, etc. degrades because of data decimation, and furthermore, since the muting is performed at discontinuous portions between blocks, discontinuities are introduced in signal amplitude, resulting in speech reproduction lacking in naturalness.
Another technique of speech time-scale modification is disclosed in "Real-Time Implementation of Time Domain Harmonic Scaling of Speech for Rate Modification and Coding" by R. V. Cox et al., IEEE TRANSACTIONS ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL. ASSP-31, No. 1, pp. 258-272, February 1983.
This speech time-scale modification technique is called Time Domain Harmonic Scaling (TDHS), in which a pitch period p is extracted from an input signal S(n) and each input signal S(n) is weighted with a triangular window (W.sub.c (n) or W.sub.e (n)) and added, so as to obtain an output signal ((S.sub.c (n) or S.sub.e (n)). EQU S.sub.c (n)=W.sub.c (n)S(n)+[1-W.sub.c (n)]S(n+p)(time-scale compression) EQU S.sub.e (n)=W.sub.e (n)S(n)+[1-W.sub.e (n)]S(n-p)(time scale compression)
Herein, the triangular window (W.sub.c (n) or W.sub.e (n)) is obtained from the following equation: EQU W.sub.c (n)=1-n/(B.sub.c -1) n=0, 1, . . . Bc (time-scale compression), EQU W.sub.e (n)=1-n/(B.sub.e -1) n=0, 1, . . . Be (time-scale expansion),
where the window length is determined by the following equation:
B.sub.c =p/(1/.alpha.-1) (time-scale compression), PA2 B.sub.e =.alpha.p/(.alpha.-1) (time-scale expansion), PA2 B.sub.c : window length (time-scale compression), PA2 B.sub.e : window length (time-scale expansion), PA2 p: pitch period, PA2 .alpha.: time-scale modification ratio=(output time duration)/(input time duration).
The TDHS uses a pitch period, but it is difficult to accurately extract the pitch period. In particular, it is extremely difficult to extract a pitch period from a music signal or a signal superposed with noise. As a result, it is difficult to sample an input signal using the length (B.sub.c or B.sub.e) that is set in terms of the pitch period p, and by overlapping or connecting input signals sampled on the basis of an incorrect pitch period, an output signal of good quality cannot be obtained.
Furthermore, the processing of the TDHS is performed on the premise that an input signal sampled using a triangular window has a constant pitch period within that window; in reality, however, when the time-scale modification ratio .alpha. is in the neighborhood of 1, the window length becomes longer (for example, B.sub.c =9p for .alpha.=0.9 and B.sub.e =11p for .alpha.=1.1), and it is unlikely that the pitch period of speech should stay constant over such a long time segment. This results in further degradation of sound quality.
Moreover, since all the output signals are constructed with signals sampled while weighting the input signals with triangular windows, the whole process involves an increased number of processing steps, so that sound quality degrades significantly as a result of the processing.