The invention relates to a method as well as a device for the artificial extension of the bandwidth of speech signals.
Speech signals cover a wide frequency range that extends from the fundamental speech frequency, which depending on the speaker lies in the range between 80 to 160 Hz, up to the frequencies beyond 10 kHz. However, during speech communication via particular transmission media, such as telephones for example, only a limited segment is transmitted for reasons of bandwidth efficiency, whereby a sentence intelligibility of approximately 98% is ensured.
Corresponding to the minimum bandwidth from 300 Hz to 3.4 kHz specified for the telephone system, a speech signal can essentially be divided into three frequency ranges. In this way, each of these frequency ranges characterizes specific speech properties as well as subjective perceptions. Thus lower frequencies below approximately 300 Hz primarily arise during sonorous speech segments such as vowels, for example. In this case, this frequency range contains tonal components, which in particular means the fundamental speech frequency as well as several possible harmonics, depending on the pitch of the voice.
These low frequencies are important for the subjective perception of the volume and dynamics of a speech signal. In contrast, the fundamental speech frequency can be perceived by a human listener as a result of the psycho-acoustic property of virtual pitch perception from the harmonic structure in higher frequency ranges even if the low frequencies are missing. Thus medium frequencies in the range from approximately 300 Hz to approximately 3.4 kHz are basically present in the speech signal during speech activities. Their time-variant spectral coloration by multiple formants as well as the temporal and spectral fine structure characterizes the spoken sound or phoneme in each instance. In such a manner, the medium frequencies transport the main part of the information relevant for the intelligibility of the speech.
Alternatively, high frequency rates above approximately 3.4 kHz develop during unvoiced sounds, as is particularly strongly the case during sharp sounds such as “s” or “f”, for example. In addition, so-called plosive sounds like “k” or “t” have a wide spectrum with strong high-frequency rates. Therefore, the signal has more of a noisy character than a tonal character in this upper frequency range. The structure of the formants that are also present in this range is relatively time-invariant, but varies for different speakers. The high frequency rates are of considerable importance for clarity, presence and naturalness of a speech signal, because without these high frequency rates the speech sounds dull. Furthermore, superior differentiation between fricatives and consonants is made possible by high frequency rates of this type, whereby these high frequency rates also thereby ensure increased intelligibility of the speech.
During a transmission of a speech signal via a speech communications system comprising a transmission channel with a limited bandwidth, in principle it is desired and is always the goal that the speech signal to be transmitted be capable of transmission with the best-possible quality from a transmitter to a receiver. Here the speech quality is however a subjective variable with a plurality of components, of which the intelligibility of the speech signal represents the most important for a speech communications systems of this type.
A relatively high level of speech intelligibility can already be achieved with modern digital transmission systems. At the same time, it is known that an improvement in the subjective assessment of the speech signal is made possible by an extension of the telephone bandwidth at high frequencies (higher than 3.4 kHz) as well as at low frequencies (lower than 300 Hz). In terms of a subjective quality improvement, a bandwidth increased in comparison to the normal telephone bandwidth is to be targeted for systems for speech communication. One possible approach relates to in modifying the transmission and in effecting a wider transmitted bandwidth by an encoding method, or alternatively in performing an artificial bandwidth extension. Through an extension of the bandwidth of this type, the frequency bandwidth on the receiver side is widened to the range from 50 Hz to 7 kHz. Suitable signal processing algorithms allow parameters to be determined for the wideband model from short segments of a narrowband speech signal using methods of pattern recognition, said parameters then being used to estimate the missing signal components for the speech. With the method, a wideband equivalent with frequency components in the range 50 Hz to 7 kHz is created from the narrowband speech signal, and an improvement in the subjectively perceived speech quality is effected.
In current speech signal and audio signal encoding algorithms, additional techniques of artificial bandwidth extension are used. For example, in the wideband range (acoustic bandwidth of 50 Hz to 7 kHz) speech encoding standards such as the AMR-WB (Adaptive Multirate Wideband) encoding-decoding algorithm are used. With this AMR-WB standard, upper frequency subbands (frequency range of approximately 6.4 to 7 kHz) are extrapolated from lower frequency components. In encoding-decoding methods of this type, the bandwidth extension is generally produced by a comparatively small amount of ancillary information. This ancillary information can be filter coefficients or amplification factors for instance, whereby the filter coefficients can be produced by an LPC (Linear Prediction Filter) method for example. This ancillary information is transmitted to a receiver in an encoded bitstream. Other standards which are based on the extension of the bandwidth technique can currently be seen in the standards AMR-WB+ and the extended aacPlus speech/audio encoding-decoding method. Methods that are designed to encode and decode information are called codecs and include both an encoder as well as a decoder. Every digital telephone, regardless of whether it is designed for a fixed network or a mobile radio network, contains a codec of the type that converts analogue signals into digital signals, and digital signals into analogue signals. A codec of this type can be implemented in hardware or in software.
In current implementations of speech/audio signal encoding algorithms in which the technology for bandwidth extension is used, components of an extension band, for example in the frequency range from 6.4 to 7 kHz, are encoded and decoded by the LPC encoding technology already mentioned. In doing so, an LPC analysis of the extension band of the input signal is carried out in an encoder, and the LPC coefficients as well as the amplification factors are encoded from subframes of a residual signal. The residual signal of the extension band is produced in a decoder, and the transmitted amplification factors and the LPC synthesis filters are used for the generation of an output signal. The approach described above can be used either directly on the wideband input signal or even with a subband signal from the extension band downsampled at a threshold or in a critical range.
In the extended aacPlus encoding standard, the SBR (Spectral Band Replication) technique is used. At the same time, the wideband audio signal is split into frequency subbands by a 64-channel QMF filter bank. For the high-frequency filter bank channels, a sophisticated and technically highly developed parametric encoding is applied to the subbands of the signal components, whereby a large number of detectors and estimators are necessary for this purpose, which are used in order to control the bitstream content. Even though an improvement, in particular in the speech quality of speech signals, can already be achieved using the known standards and encoding-decoding methods, an additional improvement in this speech quality is nevertheless to be targeted. Furthermore, the standards and encoding-decoding methods described above are very time-consuming and have a very complex structure.