1. Field of the Invention
The present invention relates to speech speed conversion. Particularly, the invention relates to a speech speed converting device and a speech speed converting method for changing a voice speed without degrading the voice quality and without changing characteristics, regarding a signal containing voice.
2. Description of the Related Art
A speech speed converting device is used in a telephone system or a voice reproducing system. By changing the speed of the voice at the time of reproducing a received voice or a recorded voice, a user can listen to the received content or the recorded content at a speed convenient for the user. For example, when a person at the other end of the line speaks quickly and a person who receives the call cannot easily understand the voice, the speed of the speech is decreased in real time or at the reproduction time. With this arrangement, the listener can understand the speech content easily. On the other hand, by increasing the speed of the voice at the reproduction time, the recorded content can be heard in a time shorter than the actual recording time.
FIG. 1 shows one example of a speech speed converting device that is applied to a voice communication system such as a telephone.
In FIG. 1, a receiving unit 10 of the telephone receives a voice code via a digital line or the like. A decoding unit 11 decodes the voice code into a voice waveform signal. A speech speed converting unit 12 including a speech speed converting device converts the voice waveform signal into a voice waveform signal of a slower speed, for example. An output unit 13 such as a receiver outputs the received voice to the outside. While the decoding unit 11 restores the voice code into the voice waveform, in the present example, the speech speed converting unit 12 can directly convert the speed of the voice code received by the receiving unit 10, decode the speed-converted voice code, and input the decoded voice to the output unit 13.
As a method of converting the speech speed, a time-domain harmonic scaling (TDHS) is widely known. According to the TDHS, a waveform of voice of which speed is to be changed is repeated in a basic frequency or is thinned, thereby adjusting the speed. There are also improved methods of repeating or thinning the waveform to convert the speech speed. One example is that voice is classified into several kinds, and a speed converting method is switched over between classified voices.
FIG. 2 shows one example of a configuration of a conventional speech speed converting device using a voice waveform.
In the present example, a voice classifying unit 20 classifies an input voice waveform into “voiced sound” and “unvoiced sound”. When the input voice waveform is “voiced sound”, a pitch cycle calculating unit 21 calculates a pitch cycle of the “voiced sound”. A voice speed converting unit 22 adjusts the speed of the voice by repeating or thinning the “voiced sound” waveform input based on the pitch cycle calculated by the voice speed converting unit 22.
According to the following patent literature 1, voice is classified into “vowel sound”, “voiced consonant”, “unvoiced consonant”, and “silence”. The speed of the “vowel sound” and the “voiced consonant” is converted by repeating or thinning the voice waveform in a pitch cycle. The “unvoiced consonant” is not expanded or contracted according to the characteristic of the consonant, or the speed is converted by repeating or deleting the waveform to have a predetermined length. On the other hand, the speed of the “silence” is converted by repeating or deleting the waveform to have a predetermined length.
According to the following patent literature 2, voice is classified into “voiced sound”, “unvoiced sound”, and “silence”. The speed of the “voiced sound” is converted by repeating or thinning the voice waveform in a pitch cycle. The “unvoiced sound” is not processed, and the speed of the “silence” is converted by expanding or contracting the waveform at a predetermined magnification.
According to the following patent literature 3, voice is classified into “voiced sound”, “unvoiced sound”, and “silence”. The speed of the “voiced sound” is converted by repeating or thinning the voice waveform in a pitch cycle. The speed of the “unvoiced sound” is converted by repeating or thinning the voice waveform in a fixed cycle (i.e., a pseudo pitch). The speed of the “silence” is converted by repeating or thinning the waveform following a predetermined expansion and contraction rate.
FIG. 3 shows one example of a configuration of a conventional speech speed converting device using a voice code.
In the present example, a residual signal and a linear predictive coefficient of an input voice are obtained in advance based on a linear predictive analysis of the input voice. A pitch cycle calculating unit 30 calculates a pitch cycle of an input signal using the residual signal. A voice production speed converting unit 31 outputs a residual signal that is repeated or thinned based on the calculated pitch cycle, thereby converting the speed, and gives the speed conversion information to a linear predictive coefficient correcting unit 32.
The linear predictive coefficient correcting unit 32 corrects and outputs a linear predictive coefficient corresponding to the residual signal that is repeated or thinned based on the speed conversion information. A combining unit 33 filters the residual signal input from the voice production speed converting unit 31 using the linear predictive coefficient given from the linear predictive coefficient correcting unit 32, and outputs the speed-converted voice waveform.
The following patent literature 4 describes a method of carrying out a linear predictive analysis to separate the input voice into a linear predictive coefficient and a predictive residual signal, and preventing degradation in the pitch analysis due to a pitch extraction error by repeating or thinning the predictive residual signal having a strong pitch in a pitch cycle. When the linear predictive analysis is used, with a view to improving precision of the pitch analysis, the pitch is extracted using the predictive residual in which pitch appears more strongly than a voice waveform. The predictive residual is repeated or thinned in the extracted pitch cycle.
The following patent literature 5 describes a method of converting the speed by extending a multi-path sound source by filling “0” using a voice code, or by shortening the sound source by cutting “0”.
(Patent literature 1) Japanese Patent Publication No. 2612868
(Patent literature 2) Japanese Patent Publication No. 3327936
(Patent literature 3) Japanese Patent Publication No. 3439307
(Patent literature 4) Japanese Patent Application Unexamined Publication No. 11-311997
(Patent literature 5) Japanese Patent Publication No. 3285472
However, the above conventional techniques have the following problems.
(1) Problems that arise when the speed is converted using the voice waveform
According to the patent literature 1, in the “unvoiced consonant”, waveforms of sections other than those discriminated as “liquid sound”, “plosive and affrictive sound”, and “burst” are repeated or thinned. Therefore, there is a problem that cyclicity that is not originally present appears due to the repetition or thinning of the waveform, and the voice quality is degraded.
According to the patent literature 2, the “unvoiced sound” is not processed. Therefore, there is a problem that when the “unvoiced sound” is expanded or contracted, the balance of the length with that of other sections is destroyed, and the voice quality is degraded. In this case, a section that can be expanded or contracted becomes small, and a large expansion or contraction cannot be achieved. According to the patent literature 3, because the “unvoiced sound” is thinned or repeated in a fixed cycle (i.e., a pseudo pitch), there is a problem that cyclicity that is not originally present appears, and the voice quality is degraded.
(2) Problems that arise when the speed is converted using the voice code such as a linear predictive analysis
According to the patent literature 4, there is a problem that, in the unvoiced section in which a pitch cycle is not particularly present, a repetition or a thinning is carried out in an extremely long or short section in an indefinite pitch (i.e., a variation in an extremely large or small pitch value). As a result, a mismatch occurs between a linear predictive coding (LPC) coefficient and the predictive residual, in the section where the LPC coefficient changes, thereby degrading the voice quality.
According to the patent literature 5, a multi-path sound source is extended by filling “0” using a voice code, or is shortened by cutting “0”. There is also a problem that the speed cannot be adjusted in the unvoiced section where there is no pitch. Therefore, the balance of the length with that of other section that is expanded or contracted is destroyed, and the voice quality is degraded. When “0” is filled, an expandable or contractible section decreases. Consequently, a large expansion or contraction cannot be achieved.