1. Field of the Invention
The present invention generally relates to apparatuses and methods for changing reproduction speeds of speech sounds. More particularly, the present invention relates to an apparatus and a method for changing reproduction speed of speech sound without changing the pitch of the sound.
2. Description of the Related Art
Conventionally and continuously techniques have been suggested wherein reproduction speed of speech sound is reduced without changing the sound pitch so that contents of conversation can be easily heard. In this case, if only the reproduction speed of speech sound is simply reduced, a delayed amount of data is generated.
In order to solve such a problem, a technique for solving the delay problem by shortening a silent section (no-sound section) existing in the conversation or by making reproduction speed of speech sound in the silent section, has been suggested.
FIG. 1 is a block diagram of an example of a related art apparatus for changing reproduction speed of speech sound. Referring to FIG. 1, a digital sound signal of a frame unit is input to a terminal 10 at one frame 20 ms so as to be supplied to a sound activity determination part 11 and a part 12 for changing reproduction speed of speech sound.
The sound activity determination part 11 analyzes a noise level at an initial silent time such as a time when conversation is started, and sets the analyzed silent level such as +4 dB as a sound threshold value. The sound activity determination part 11 compares the input sound signal and the sound threshold value and determines that a section where the sound signal is equal to or greater than the sound threshold value is a sound determining section. The sound activity determination part 11 also supplies the result of the determination to a part 13 for determining reproduction speed of speech sound.
An input storing amount computing part 14 supplies a storing amount (storing frame number) to the part 13 for determining reproduction speed of speech sound. A speech head protection section (fixed frame number) is set in the part 13 for determining reproduction speed of speech sound. The part 13 for determining reproduction speed of speech sound determines the reproduction speed of speech sound based on the result of the above-mentioned determination, the storing amount, and the speech head protection section. The part 13 for determining reproduction speed of speech sound supplies the reproduction speed of speech sound to the part 12 for changing reproduction speed of speech sound and the input storing amount computing part 14.
The part 12 for changing reproduction speed of speech sound writes an input sound signal in a buffer and reads the sound signal from the buffer based on the reproduction speed of speech sound from part 13 for determining reproduction speed of speech sound so as to output the sound signal from a terminal 15. The input storing amount computing part 14 calculates the storing amount stored in the buffer of the part 12 for changing reproduction speed of speech sound, based on the reproduction speed of speech sound from part 13 for determining reproduction speed of speech sound so as to supply the storing amount to the part 13 for determining reproduction speed of speech sound.
FIG. 2 is a table for determining reproduction speed of speech sound of the part 13 for determining reproduction speed of speech sound of the related art case.
In a sound section, the reproduction speed of speech sound is set to be 0.5 time (2-times extension). In a case where a process delay time is equal to or greater than 1 second (equal to 50 frames), the reproduction speed of speech sound is set to be 1-time.
In a speech head protection section, namely in a case where a sound determining section is provided within following 3 frames, the reproduction speed of speech sound is set to be 1-time. In a speech end protection section, namely in a case where a sound determining section is provided within past 10 frames, the reproduction speed of speech sound is set to be 1-time.
In a pause holding section, namely within 10 frames after the speech end protection, the reproduction speed of speech sound is set to be 1-time. In a section where no-sound is deleted, the sound signal is deleted other than the above-mentioned sections. If there is no process delay time, reproduction speed of speech sound is set to be 1-time.
Japanese Laid-Open Patent Application Publication No. 2001-222300 describes that speech speed of a voice section held between non-voice sections of a fixed time length or above is converted so that the speed becomes lower at its top part than the prescribed reproducing speed, and is returned gradually to the prescribed reproducing speed toward the end.
However, in the process for shortening the no-sound section or the process for decreasing the reproduction speed of speech sound in the no-sound section, it is necessary to consider precision of sound activity determination. For example, under a noisy environment, error determination may happen in the sound activity determination. Under a no noisy environment, the sound activity determination is made relatively securely even at the speech head or the speech end.
However, under the noisy environment, the noise level may be close to or exceed a power value at the speech head or the speech end. In this case, the speech head or the speech end may not be recognized due to the noise.
Because of this, under the noisy environment, it is difficult to realize the sound activity determination. For example, under the noisy environment, while a part where the voice power is small such as the speech head or no-sound consonant is in the sound section, it may be determined in error that the part is no-sound.
If a process for shortening the no-sound section or for quickening the reproducing speed based on error determination is implemented, sound may be cut or no-sound continuing length may be shortened too much.
FIG. 3 is a graph showing input speech sound signal power and speech sound signal power after the reproduction speed of speech sound is changed, in the related art case.
In FIG. 3(A), variation with time of input voice signal power (sound volume) is indicated by solid lines. Noise having a steady power level is superimposed to the sound signal and its noise level +4 dB is set as a sound threshold value. Determination results of the sections are shown at a lower part of FIG. 3(A).
A part from the speech head of the speech head protection section and a part from the speech end of the speech end protection section are shown in FIG. 3. 1st, 2nd, 5th, and 6th voices from the left side are determined to be sound sections. On the other hand, 3rd and 4th voices are determined to be sections of no-sound due to noises.
While the 3rd voice is not deleted because of protection of the speech end, the speech head of the 4th voice is cut because the fixing speech head protection section is short. FIG. 3(B) shows sound signal power after the reproduction speed of speech sound is changed.
Section (1) of FIG. 3(B):
There are 10 frames of process delay (input storing) of change of the reproduction speed at the starting point.
Section (2) and Section (3) of FIG. 3(B):
The 1st and 2nd voices are determined to be sounds and therefore the ratio of wave length extension becomes 2-times extension. The reproduction speed between the section (2) and the section (3) is 1-time output due to the speech head protection and the speech end protection.
Section (4) of FIG. 3(B):
The 3rd voice is determined to be no-sound and is in the section of the speech end protection and the pause protection. Therefore, the reproduction speed is 1-time speech.
Within the pause holding section is the no-sound section after this, the reproduction speed is 1-time speed. After this, the reproduction speed is deleted.
Section (5) of FIG. 3(B):
The 4th voice is determined to be no-sound and the speech head protection is applied to only a part. Since there is sufficient delay in change of reproduction speed (input storing amount) at this point, 1-time speed of the reproduction speed is output in the protection section. Other than this section, the reproduction speed is deleted so that the speech head is cut.
Section (6) of FIG. 3(B):
The 5th voice is determined to be the sound and therefore the ratio of wave length extension becomes 2-times extension.
In the conventional art case, since a speech head protection section having a fixed length is set in the speech head protection, it is necessary to insert or add the delay of the speech head protection. For example, sufficient speech head protection can be set in a storing sound such as answering service of the telephone. However, in a case where the reproduction speed is changed for actual communication, it is necessary to make the delay as small as possible. Therefore, in this case, it is not possible to set the speech head protection section having a sufficient length so that the speech head may be cut.