The present invention relates to a speech speed converting method and a device for embodying the same which are able to achieve easiness of hearing expected in speech speed conversion without extension of playback time in various video devices, audio devices, medical devices, etc. such as a television set, a radio, a tape recorder, a video tape recorder, a video disk player, a hearing aid, etc.
The present invention also relates to a speech interval detecting method and a device for embodying the same which are able to discriminate between speech intervals and non-speech intervals of an input signal in the event that the speech which is delivered together with noises or background sounds in a broadcast program, a recording tape, or a daily life is processed to change height of the voice or speech speed, the meaning of the speech is mechanically recognized, the speech is coded to transfer or record, or the like.
[Outline of the Invention]
The present invention relates to a speech speed converting method and a device for embodying the same which converts a speech speed in real time by processing the speech made by the human being, and carries out a series of processes without omission of information, while monitoring always a data length of the input speech, an output data length calculated previously according to a conversion function, which is concerned with a previously given scaling factor, and a data length of the speech being output actually in constant process unit when a delivered speed (speech speed) of listening speech is made slow.
Furthermore, in the speech speed converting method and the device for embodying the same, for example, the non-speech interval which has a length in excess of a variable threshold value being set according to a delay degree (conversion factor) expected in speech speed conversion can be reduced appropriately while aiming at minimizing the time difference between the image and the speech caused by extension of the speech in watching the television receiver, and maximum slowness impression which can be accomplished within a decided time range can be created automatically by changing adaptively a conversion factor according to a degree of time difference between the input data length and the output data length, while keeping substantially a speaking time of the converted speech within a speaking time of an original speech.
Moreover, the present invention calculates the power of input signal data at a predetermined time interval in frame unit having a predetermined time width, and then discriminates between the speech interval and the non-speech interval every frame by using the threshold value for the power which is changed according to the maximum value and the difference between the maximum value and the minimum value, while holding the maximum value and the minimum value of the power within the past predetermined time period, so as to respond sequentially to change in respective powers of the input speech and the background sound. As a result improvement in quality of processed sound, improvement in the speech recognition rate, increase in the coding efficiency, and improvement in quality of the decoded speech can be achieved by detecting precisely the speech interval of the input signal in the case that changed in height of the voice or speech speed, mechanical recognition of the meaning of the speech, and coding of the speech to transfer or record, and the like are effected by processing the speech which is delivered together with noises or background sounds in a broadcast program, a recording tape, or a daily life.
In addition, the speech processing can be executed in real time while shortening a calculation time and also reducing a cost, by employing only the power which can be derived relatively simply as a feature parameter.
In case the speech speed converting method is applied to the actual broadcast, there are some cases where delay from the original speech such as emergency news becomes an issue. Particularly, it is possible that this delay has a bad effect on the visual media in contrast with the effect expected in the speech speed conversion.
Therefore, as approaches for achieving the speech speed converting effect (slowness impression) without delay from the original speech, there have been reported the method of suppressing extension in time by changing the speech speed from slowly to quickly as a function of a lapse time from a start point of one breath speech to an end point instead of uniformly slow conversion, and then reducing appropriately the non-speech interval between sentences (R. Ikezawa et al., xe2x80x9cAn Approach for Absorbing Extension in Time Caused in Speech Speed Conversionxe2x80x9d, Spring Conference, Japanese Acoustic Society, 2-6-2, pp.331-332, 1992), the method of achieving this approach in real time (A. Imai et al., xe2x80x9cReal Time Absorption Method for Extension in Time Caused in Speech Speed Conversionxe2x80x9d, in International Conference, IEICE, D-694, pp 300, 1995), etc.
The former sets an appropriate function manually under that assumption that all speech styles have been known. The latter also sets a function defining a factor manually, and fixes this function after the function has been set once.
In addition, only the constant remaining time is set manually to reduce the non-speech interval. If a deal of xe2x80x9cinconsistencyxe2x80x9d is integrated, the extended speech being accumulated in a buffer is cleared manually.
Therefore, in the speech speed converting device in the prior art, there has been such a problem that, since various speaking styles (speech speed, xe2x80x9ctimingxe2x80x9d in speech, etc.) are present in the broadcast speech according to the speaker and also appropriate parameters must be set manually respectively, the device has many operation points, setting per se is difficult, and it is difficult for the common user to handle the device.
Besides, in the above speech speed converting device, the speech interval and the non-speech interval must be recognized separately. There are various systems as the speech interval detecting system in the prior art.
As one of the speech interval detecting system in the prior art, such a system has been known that a noise level and a speech level are calculated based on the power of the speech signal, etc., then a level threshold value is set based on the calculation result, then this level threshold value and the input signal are compared with each other, then the interval is decided as the speech interval if the level of the input signal is higher than the level threshold value and the interval is decided as the non-speech interval if the level of the input signal is lower than the level threshold value.
As methods of setting the level threshold value employed in this system, there are first to third representative systems. According to the first system, a value which is obtained by adding a preselected constant to a noise level value of the input speech is employed as the level threshold value. According to the second system which is an improved first system, the level threshold value is set to a relatively large value when a value obtained by subtracting the noise level value from a maximum level value of the input speech signal is large, whereas the level threshold value is set to a relatively small value when the value obtained by subtracting the noise level value from a maximum level value of the input speech signal is small (for example, Patent Application Publication (KOKAI) Sho 58-130395, Patent Application Publication (KOKAI) Sho 61-272796, etc.).
According to the third system, in addition to these level threshold value setting methods, the input signal is monitored continuously, then the input signal is regarded as the noise level when the level of the input signal is steady over a constant time period, and then a threshold value employed for the speech interval detection is set while updating the noise level sequentially (Proceeding in International Conference, IEICE, D-695, pp 301, 1995).
However, in the above speech interval detecting system in the prior art, there have been problems described in the following.
To begin with, the first system has an advantage that it is simple, and can operate well when the average level of the speech is a middle level. However, the first system is easy to detect the noise, etc. erroneously as speech when the average level of the speech is too large, and it is easy to detect the speech with omission of a part of the speech when the average level of the speech is too small.
Then, the second system can overcome the problem arisen in the first system. However, there has been such a problem that, since the event that levels of the noises and the background sounds in the input signal are kept substantially constant is employed as a premise, the second system can follow the variation in level of the speech, but the precise speech interval detection cannot be assured when levels of the noises and the background sounds are changed at every moment.
Then, since the variation in such noise level is considered into the third system, erroneous detection is not caused even when the noise level is changed sequentially.
However, not only the noise but also the background sound such as music, imitation sound, etc. as sound effects are included in the broadcast program, etc., and commonly these levels are changed at every moment and at the same time the speech is always continued to deliver, so that the input signal level seldom becomes steady over a predetermined time period. In such case, there has been such a problem that, since the noise level cannot be set correctly even by the third system, it is difficult to detect precisely the speech interval.
The present invention has been made in view of the above circumstances, and it is an object of the present invention to provide a speech speed converting method and a device for embodying the same which is capable of controlling adaptively the speech speed conversion factor and the non-speech interval according to set conditions only by setting the conversion factor employed as the several-stage aims once by the user, and also achieving the expected effect for the speech speed conversion stably within the time range which is delivered actually.
Also, it is another object of the present invention to provide a speech interval detecting method and a device for embodying the same which is capable of discriminating the speech interval and the non-speech interval by executing the speech processing in real time so as to respond sequentially to change in the respective levels of the input speech and the background sound, while shortening the calculation time and also reducing the cost, since only the power which can be derived relatively simply as a feature parameter is employed.
In order to achieve the above object, there is provided a speech interval detecting method set forth in claim 1 comprising the steps of calculating a frame power of an input signal data in unit of predetermined frame width at a predetermined time interval, and then holding a maximum value and a minimum value of the frame power within a past predetermined time period; deciding a threshold value for power changed according to the maximum value being held and difference between the maximum value and the minimum value; and comparing the threshold value with power of a current frame to decide whether or not the current frame belongs to a speech interval or a non-speech interval.
According to the above configuration, in the speech interval detecting method aspect of the invention, a frame power of an input signal data is calculated in unit of predetermined frame width at a predetermined time interval, then a maximum value and a minimum value of the frame power within a past predetermined time period are held, then a threshold value for power is decided according to the maximum value being held and difference between the maximum value and the minimum value, and then the threshold value and power of a current frame are compared with each other to decide whether or not the current frame belongs to a speech interval or a non-speech interval. Therefore, the speech interval and the non-speech interval can be discriminated by executing the speech processing in real time while responding sequentially to change in respective levels of the input speech and the background sound.
According to the speech interval detecting method set forth in the preceding paragraph, if the difference between the maximum value and the minimum value is less than a predetermined value, the threshold value is decided close to the maximum value rather than a case where the difference between the maximum value and the minimum value is more than the predetermined value.
In order to achieve the above object, there is provided a speech interval detecting device as in the preceding paragraph including a power calculator for calculating a frame power of an input signal data in unit of predetermined frame width at a predetermined time interval; an instantaneous power maximum value latch for holding a maximum value of the frame power within a past predetermined time period; an instantaneous power minimum value latch for holding a minimum value of the frame power within the past predetermined time period; a power threshold value decision portion for deciding a threshold value for power changed according to the maximum value being held in the instantaneous power maximum value latch and difference between the maximum value and the minimum value being held in the instantaneous power minimum value latch; and a discriminator for comparing the threshold value obtained by the power threshold value decision portion with power of a current frame to decide whether or not the current frame belongs to a speech interval or a non-speech interval.
According to the above configuration, in the speech interval detecting device set forth in the preceding paragraph, a power calculator calculates a frame power of an input signal data in unit of predetermined frame width at a predetermined time interval, an instantaneous power maximum value latch holds a maximum value of the frame power within a past predetermined time period, an instantaneous power minimum value latch holds a minimum value of the frame power within the past predetermined time period, a power threshold value decision portion decides a threshold value for power changed according to the maximum value being held in the instantaneous power maximum value latch and difference between the maximum value and the minimum value being held in the instantaneous power minimum value latch, and a discriminator compares the threshold value obtained by the power threshold value decision portion with power of a current frame to decide whether or not the current frame belongs to a speech interval or a non-speech interval. Therefore, while shortening a calculation time and also reducing a cost by employing only the power which can be derived relatively simply as a feature parameter, the speech interval and the non-speech interval can be discriminated by executing the speech processing in real time so as to respond sequentially to change in the respective levels of the input speech and the background sound.
According to the speech interval detecting device set forth the preceding paragraph, if the difference between the maximum value and the minimum value is less than a predetermined value, the power threshold value decision portion decides the threshold value close to the maximum value rather than a case where the difference between the maximum value and the minimum value is more than the predetermined value.