In conventional communication systems, telephonic devices are designed to receive speech signals (as acoustic signals) from a user and convert the speech signals to digital signals before encoding the speech signals in a speech encoder for transmission over the communication system. The telephonic devices are designed to yield a frequency response of the transfer function representing all stages from the acoustic signal to the digital signal that matches the characteristics of the sending Intermediate Reference System (IRS) specified in ITU-T P.48 standard, “Specification for an Intermediate Reference System”, ITU-T Recommendation P.48, 1988. The frequency characteristics of the Intermediate Reference System according to ITU-T P.48 are shown in FIG. 1.
The frequency characteristics of the IRS provide an emphasis to the speech frequency band that is considered most important for speech intelligibility. That is, more weight is given to the second formant frequencies rather than to the first formant frequencies, which is known to increase intelligibility of clipped speech, as discussed in I. B Thomas, “The Influence of First and Second Formants on the Intelligibility of Clipped Speech,” Journal of Audio Engineering Society, Vol. 16, No. 2, 1968. It can be seen in FIG. 1 that the speech signal is a narrowband signal with frequencies limited to the range 0 to 4 kHz. Frequency components of the speech signal below 2 kHz and above 3.4 kHz are attenuated by the frequency response of the IRS, whilst a gain is applied to frequency components of the speech signal between 2 kHz and 3.4 kHz by the frequency response of the IRS. It can also be seen that frequency components below 300 Hz are strongly attenuated.
By concentrating the energy of a narrowband signal into the second formant frequencies the intelligibility of the narrowband signal is improved, allowing improved intelligibility of a speech signal at a receiver of a call without increasing the bandwidth requirements.
Thus, conventional communication systems, for example the Public Switched Telephone Network (PSTN) based on fixed line and/or mobile networks, are designed to have average frequency responses as defined in the IRS specification, that emphasize the second formant frequencies. However, the increase in intelligibility of the speech signals comes at the expense of speech naturalness. A speech signal is distorted by applying the frequency response shown in FIG. 1 such that applying the frequency response alters the perception of the speech signal, i.e. it does not sound completely natural (or unaltered). The naturalness of the speech is affected because different levels of attenuation (or amplification) are applied to different frequency components of the speech signal (i.e. the frequency response shown in FIG. 1 is not flat).
In U.S. Pat. No. 5,195,132 by Bowker et al there is described a method of enhancing a speech signal transmitted between telephone stations, wherein the enhancement is performed at some point along the connection between the transmitting and receiving telephone stations, such that the speech signal is enhanced before arriving at the receiving telephone station. The frequency range 100 to 300 Hz is amplified relative to the remainder of the telephone passband before supplying the speech signals to the receiving telephone station.
In the paper by Y. Qian and p. Kabal “Combining Equalization and Estimation for Bandwidth Extension of Narrowband Speech”, in Proc. IEEE Int. Conf. Accoust. Speech Sign. Process., 2004, pp. 713-716 there is disclosed the use of a fixed equalizer to filter received speech signals prior to applying a system of Artificial Bandwidth Extension (ABE). The equalization is employed to expand the apparent bandwidth of the narrowband speech signal. Equalization is applied both at low frequencies as well as at high frequencies. Their equalizer is designed specifically for the ITU-G.712 specification (G.712 “transmission performance characteristics of pulse code modulation channels”, ITU-T Recommendation G.712, November 1996) and provides a 10 dB gain in the frequency range 3.8 to 4 kHz and a 10 dB gain at 100 Hz. Between 100 and 3800 Hz the frequency response of the equalizer is essentially flat.
The two prior art systems described above apply fixed gains to particular sections of the speech signal, such as the low frequency components (e.g. 100-300 Hz) and/or the high frequency components (e.g. 3.8-4 kHz). This results in a different level of attenuation (or amplification) for different frequency components of the speech signal. Although the naturalness of the speech signal can be improved to a certain degree by the prior art systems described above, the speech signals will not sound completely natural when using one of the prior art systems described above.
It is an aim of embodiments of the present invention to improve the naturalness of the speech signal.