The invention relates to a method of converting the speech rate of a speech signal having a pitch period below a maximum expected pitch period. The method comprises the steps of dividing the speech signal into segments, estimating the pitch period of the speech signal in a segment, copying a fraction of the speech signal in the segment, said fraction having a duration equal to said estimated pitch period, providing from said fraction an intermediate signal having the same duration, and expanding the segment by inserting said intermediate signal pitch synchronously into the speech signal of the segment. The invention also relates to the use of the method in a mobile telephone. Further, the invention relates to a device adapted to convert the speech rate of a speech signal.
In many situations it is desirable to enhance the intelligibility of speech. Especially elderly people are often troubled by some hearing impairment, which among other things lowers their comprehension of speech uttered rapidly. Also children with language-learning difficulties could benefit from an improved intelligibility. Further, when mobile telephones are used in noisy environments it can be difficult to fully understand what is being said. This difficulty occurs not only for hearing impaired people, but also for anybody else. Therefore, there is an increasing demand for obtaining an enhanced intelligibility in mobile telephones.
One way of enhancing the intelligibility of the speech is to slow down the speech. The principal objective of this approach is to give the listener some extra time to recognize what is being said. This can be obtained by using time-scaling techniques, which means that the temporal evolution of the signal is changed. The speech rate is adjusted by adding extra time data to the signal according to a chosen algorithm.
Several speech enhancement algorithms exist that are based on the technique of slowing down input speech. The fundamental idea of these algorithms is to perform an extension of the speech that preserves the natural quality of the speech while the intelligibility is raised. Thereby most extension algorithms are dependent on the pitch periodicity of the speech. However, such algorithms have not been suitable for implementation in mobile telephones.
A device utilizing such an algorithm is known from the article Y. Nejime, T. Aritsuka, T. Imamura, T. Ifukube, and J. Matsushima, xe2x80x9cA Portable Digital Speech-Rate Converter for Hearing Impairmentxe2x80x9d, IEEE Transactions on Rehabilitation Engineering, vol. 4, no. 2, pp. 73-83, June 1996. The device is a hand-sized portable device that converts the speech rate without changing the pitch. When the speech speed is slowed, a time delay occurs between the input and the output speech. The speech signals are recorded into a solid-state memory while previously recorded signals are being slowed and generated. The user activates the device by holding down a button on the device. The longer the user holds the button to slow the speech, the longer the delay. Although the delay may be reduced by cutting silent intervals in excess of one second, this is not sufficient to eliminate the delay. The user can return to non-delay by releasing the button.
The speech data in the memory are partitioned into frames. The time-scaling process expands the time scale of the speech data frame by frame. The time expansion is obtained by inserting a composite pitch pattern created from the signal of three consecutive pitch periods. The composite pattern is used in order to avoid reverberation of the expanded signal. Because the time-scaling process used needs four-pitch-length data elements, the length of each frame is 48 ms corresponding to four times the assumed maximum pitch interval which is set to 12 ms in this document. Other documents mention assumed maximum pitch periods of 16 ms or even close to 20 ms, which would necessitate even longer frame lengths and thus larger amounts of data to be processed for each frame.
Especially this amount of data to be processed makes the above algorithm less interesting for use in mobile telephones, because the computational resources in a mobile telephone are severely limited. Another drawback of the algorithm is the time delay that can be accumulated while the user holds the button of the device. The use of a mobile phone is almost always a two-way communication between two persons, and therefore it is desired to keep the expanded speech as close to real time as possible.
It is an object of the invention to provide a method of the above-mentioned type in which a considerably smaller amount of data has to be processed for a frame, so that the method can be implemented with the limited computational resources of e.g. a mobile telephone.
According to the invention, this object is achieved in that a segment size longer than said maximum expected pitch period but shorter than twice the maximum expected pitch period is used.
Tests have shown that the risk of reverberation is smaller for speech signals having relatively long pitch periods, compared to short pitch periods, because they actually change more slowly. Therefore, a composite pitch pattern is not needed for these signals, and it will be sufficient to have a frame or segment length that just allows a pattern of one full pitch length to be processed. Consequently, the segment size can be reduced to a value which is only slightly longer than the maximum expected pitch period, i.e. between the maximum expected pitch period and twice the maximum expected pitch period. Obviously, the shorter segment or frame length reduces the amount of data to be processed for each segment, and it is further reduced because the calculation of the composite signal can be avoided at least for speech signals with long pitch periods. For speech signals having a shorter pitch period it may still be possible to form a composite pitch pattern from e.g. two consecutive pitch periods.
In an expedient embodiment the method further comprises the step of providing, if the actual estimated pitch period of the segment is greater than half the segment size, the intermediate signal by using the copied fraction directly as the intermediate signal. This avoids the extra calculation of a composite signal.
If the actual estimated pitch period of a segment is less than half the segment size, the method may further comprise the steps of copying two consecutive fractions, each having a duration equal to the estimated pitch period, and providing the intermediate signal as an average of the two consecutive fractions. In this way reverberation may be minimized for speech with shorter pitch periods which actually have a higher risk for such reverberation.
When the method further comprises the steps of classifying a segment of the speech signal as a silent segment, if the content of speech information is below a preset threshold, and shortening a segment, if that segment and a number of immediately preceding segments have been classified as silent segments, to compensate for expansion of previous segments, it is possible to maintain the delay between the input signal and the (expanded) output signal at a very low level, thus providing a substantial real time conversion of the speech. This makes the algorithm more suited for use in mobile telephones in which it is desired to keep the expanded speech as close to real time as possible.
An embodiment especially expedient for use in mobile telephones is obtained when a segment size of 20 ms is used, because this segment size is also used by the existing speech signal processing in many mobile telephones, and thus, a great many computational resources can be saved by using the same segments for the speech expansion algorithm.
When a segment is expanded by inserting the intermediate signal pitch synchronously into the speech signal of the segment a plurality of times, higher expansion rates can be achieved without increasing the use of computational resources considerably.
A better result without the introduction of spikes or similar discontinuities in the insertion may be achieved when an overlapping window is used when copying said fraction and inserting said intermediate signal.
A typical use of the method is in portable communications devices, and in an expedient embodiment the method is used in a mobile telephone.
As mentioned, the invention also relates to a device adapted to convert the speech rate of a speech signal having a pitch period below a maximum expected pitch period. The device comprises means for dividing the speech signal into segments, means for estimating the pitch period of the speech signal in a segment, means for copying a fraction of the speech signal in the segment, said fraction having a duration equal to said estimated pitch period, means for providing from the fraction an intermediate signal having the same duration, and means for expanding the segment by inserting said intermediate signal pitch synchronously into the speech signal of the segment. When the device is adapted to use a segment size longer than said maximum expected pitch period but shorter than twice the maximum expected pitch period, a considerably smaller amount of data has to be processed for a frame, so that the method can be implemented with the limited computational resources of e.g. a mobile telephone.
In an expedient embodiment the device is further adapted to provide, if the actual estimated pitch period of the segment is greater than half the segment size, the intermediate signal by using the copied fraction directly as the intermediate signal. This avoids the extra calculation of a composite signal.
If the actual estimated pitch period of a segment is less than half the segment size, the device may further be adapted to copy two consecutive fractions, each having a duration equal to the estimated pitch period, and to provide the intermediate signal as an average of the two consecutive fractions. In this way reverberation may be minimized for speech with shorter pitch periods which actually have a higher risk for such reverberation.
When the device is further adapted to classify a segment of the speech signal as a silent segment, if the content of speech information is below a preset threshold, and to shorten a segment, if that segment and a number of immediately preceding segments have been classified as silent segments, to compensate for expansion of previous segments, it is possible to maintain the delay between the input signal and the (expanded) output signal at a very low level, thus providing a substantial real time conversion of the speech. This makes the algorithm more suited for use in mobile telephones in which it is desired to keep the expanded speech as close to real time as possible.
An embodiment especially expedient for use in mobile telephones is obtained when the device is adapted to use a segment size of 20 ms, because this segment size is also used by the existing speech signal processing in many mobile telephones, and thus, a great many computational resources can be saved by using the same segments for the speech expansion algorithm.
When the device is adapted to expand a segment by inserting the intermediate signal pitch synchronously into the speech signal of the segment a plurality of times, higher expansion rates can be achieved without increasing the use of computational resources considerably.
A better result without the introduction of spikes or similar discontinuities in the insertion may be achieved when the device is adapted to use an overlapping window when copying said fraction and inserting said intermediate signal.
In an expedient embodiment of the invention, the device is a mobile telephone, although it may also be other types of portable communications devices.
In another embodiment the device is an integrated circuit which can be used in different types of equipment.