In case the speech speed converting method is applied to the actual broadcast, there are some cases where delay from the original speech such as emergency news becomes an issue. Particularly, it is possible that this delay has a bad effect on the visual media in contrast with the effect expected in the speech speed conversion.
Therefore, as approaches for achieving the speech speed converting effect (slowness impression) without delay from the original speech, there have been reported the method of suppressing extension in time by changing the speech speed from slowly to quickly as a function of a lapse time from a start point of one breath speech to an end point instead of uniformly slow conversion, and then reducing appropriately the non-speech interval between sentences (R. Ikezawa et al., "An Approach for Absorbing Extension in Time Caused in Speech Speed Conversion", Spring Conference, Japanese Acoustic Society, 2-6-2, pp. 331-332, 1992), the method of achieving this approach in real time (A. Imai et al., "Real Time Absorption Method for Extension in Time Caused in Speech Speed Conversion", in International Conference, IEICE, D-694, pp 300, 1995), etc.
The former sets an appropriate function manually under that assumption that all speech styles have been known. The latter also sets a function defining a factor manually, and fixes this function after the function has been set once.
In addition, only the constant remaining time is set manually to reduce the non-speech interval. If a deal of "inconsistency" is integrated, the extended speech being accumulated in a buffer is cleared manually.
Therefore, in the speech speed converting device in the prior art, there has been such a problem that, since various speaking styles (speech speed, "timing" in speech, etc.) are present in the broadcast speech according to the speaker and also appropriate parameters must be set manually respectively, the device has many operation points, setting per se is difficult, and it is difficult for the common user to handle the device.
Besides, in the above speech speed converting device, the speech interval and the non-speech interval must be recognized separately. There are various systems as the speech interval detecting system in the prior art.
As one of the speech interval detecting system in the prior art, such a system has been known that a noise level and a speech level are calculated based on the power of the speech signal, etc., then a level threshold value is set based on the calculation result, then this level threshold value and the input signal are compared with each other, then the interval is decided as the speech interval if the level of the input signal is higher than the level threshold value and the interval is decided as the non-speech interval if the level of the input signal is lower than the level threshold value.
As methods of setting the level threshold value employed in this system, there are first to third representative systems. According to the first system, a value which is obtained by adding a preselected constant to a noise level value of the input speech is employed as the level threshold value. According to the second system which is an improved first system, the level threshold value is set to a relatively large value when a value obtained by subtracting the noise level value from a maximum level value of the input speech signal is large, whereas the level threshold value is set to a relatively small value when the value obtained by subtracting the noise level value from a maximum level value of the input speech signal is small (for example, Patent Application Publication (KOKAI) Sho 58-130395, Patent Application Publication (KOKAI) Sho 61-272796, etc.).
According to the third system, in addition to these level threshold value setting methods, the input signal is monitored continuously, then the input signal is regarded as the noise level when the level of the input signal is steady over a constant time period, and then a threshold value employed for the speech interval detection is set while updating the noise level sequentially (Proceeding in International Conference, IEICE, D-695, pp 301, 1995).
However, in the above speech interval detecting system in the prior art, there have been problems described in the following.
To begin with, the first system has an advantage that it is simple, and can operate well when the average level of the speech is a middle level. However, the first system is easy to detect the noise, etc. erroneously as speech when the average level of the speech is too large, and it is easy to detect the speech with omission of a part of the speech when the average level of the speech is too small.
Then, the second system can overcome the problem arisen in the first system. However, there has been such a problem that, since the event that levels of the noises and the background sounds in the input signal are kept substantially constant is employed as a premise, the second system can follow the variation in level of the speech, but the precise speech interval detection cannot be assured when levels of the noises and the background sounds are changed at every moment.
Then, since the variation in such noise level is considered into the third system, erroneous detection is not caused even when the noise level is changed sequentially.
However, not only the noise but also the background sound such as music, imitation sound, etc. as sound effects are included in the broadcast program, etc., and commonly these levels are changed at every moment and at the same time the speech is always continued to deliver, so that the input signal level seldom becomes steady over a predetermined time period. In such case, there has been such a problem that, since the noise level cannot be set correctly even by the third system, it is difficult to detect precisely the speech interval.
The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a speech speed converting method and a device for embodying the same which is capable of controlling adaptively the speech speed conversion factor and the non-speech interval according to set conditions only by setting the conversion factor employed as the several-stage aims once by the user, and also achieving the expected effect for the speech speed conversion stably within the time range which is delivered actually.
Also, it is another object of the present invention to provide a speech interval detecting method and a device for embodying the same which is capable of discriminating the speech interval and the non-speech interval by executing the speech processing in real time so as to respond sequentially to change in the respective levels of the input speech and the background sound, while shortening the calculation time and also reducing the cost, since only the power which can be derived relatively simply as a feature parameter is employed.