The OLA (OverLap and Add) method is typically employed as one example of speed conversion that does not change pitch.
FIG. 1A shows an example of the operation of speech conversion in a related speaker speed conversion system, and shows the original waveform of speech before conversion. FIG. 1B shows an example of the operation of speed conversion in a related speaker speed conversion system, and shows the waveform of speech after conversion. In FIGS. 1A and 1B, the horizontal axis is time (sec) and the vertical axis is output voltage (V).
When converting the speed of speech, simply converting the reproduction speed causes the pitch to change and therefore does not produce speech correctly. As a result, in OLA, the reproduction time is expanded with pitch maintained unchanged by increasing the speech waveform as shown below.
(1) The speech waveform is divided into frames as shown in FIG. 1A at appropriate locations (such as at zero-cross points). In FIG. 1A, for example, frames are divided into five frames at locations of crossing zero. Although one frame is taken as one period in FIG. 1A, this method is not limited to this form, and one frame can be two periods or more.
(2) As shown in FIG. 1B, frames are repeated at an ideal frequency according to a predetermined expansion ratio. In FIG. 1B, for example, frames 1, 3, and 4 are each repeated one time.
(3) As shown in FIG. 1B, a cross-fade process is implemented before and after the repeated portions to smoothly connect the waveform of portions in which frames are repeated. In FIG. 1B, for example, the cross-fade process is applied before and after the boundary of frame 1 and frame 1, the boundary of frame 3 and frame 3, and the boundary of frame 4 and frame 4. The cross fade process is not necessary as the OLA method, but is typically carried out as a method for improving sound quality.
The related art is disclosed in JP-A-2006-038956, JP-A-2007-003682, JP-A-2006-126372, and JP-A-2000-322061.
When frame boundary detection by zero-cross or a correlation function is used, however, the problem arises in which sound quality deteriorates at sites having many high regions such as at the beginnings of words.
When frame boundary detection based on pitch detection is used, the problem arises in which frame detection is unstable at sites where pitch becomes unstable, and an OLA process of such portions results in a breakdown in sound quality.