This invention relates to a method and apparatus for postfiltering a digitally processed signal.
To enable transmission of speech at low bit rates various types of speech encoders have been developed which are used to compress a speech signal before the signal is transmitted. On receipt of the compressed signal the receiver decompresses the signal before finally being reconverted back into an audio signal.
Even though, over the same bandwidth, a compressed speech signal allows more information to be transmitted than an uncompressed signal, the quality of digitally compressed speech signals is often degraded by, for example, background noise, coding noise and by noise due to transmission over a channel.
In particular, as the encoding rate of the processed signal is reduced, the SNR also drops and the noise floor of the coding noise rises. At low encoding rates it can become impossible to keep the noise below the audible masking threshold and hence the noise can contribute to the overall roughness of the speech signal.
Two techniques have been developed to deal with this problem. The first technique uses noise spectral shaping at the speech encoder. The idea behind spectral shaping is to shape the spectrum of the coding noise so that it follows the speech spectrum, otherwise known as the speech spectral envelope. Spectrally shaped noise, when coded, is less audible to the human ear due to the noise masking effect of the human auditory system. However, at low encoding rates noise spectral shaping alone is not sufficient to make the coding noise inaudible. For example, even with noise spectral shaping, the quality of a Code Excited Linear Prediction (CELP) coder having an encoding rate of 4.8 kb/s is still perceived as rough or noisey. The second technique uses an adaptive postfilter at the speech decoder output and typically comprises a short term postfilter element and a long term postfilter element. The purpose of the long term postfilter is to attenuate frequency components between pitch harmonic peaks. Whereas the purpose of the short term postfilter is to accurately track the time-varying nature of the speech signal and suppress the noise residing in the spectral valleys. The frequency response of the short term postfilter typically corresponds to a modified version of the speech spectrum where the postfilter has local minimums in the regions corresponding to the spectral valleys and local maximums at the spectral peaks, otherwise known as formant frequencies. The dips in the regions corresponding to the spectral valleys (i.e. local minimums) will suppress the noise, thereby accomplishing noise reduction. This has the effect of removing noise from the perceived speech signal. The local maximums allow for more noise in the formant regions, which is masked by the speech signal. However, some speech distortion is introduced because the relative signal levels in the formant regions are altered due to the postfiltering.
Most speech codecs use a time domain based postfilter based on U.S. Pat. No. 4,969,192. In this technique the postfiltering is implemented temporally as a difference equation. As such, the postfilter can be described by a transfer function. Consequently it is not possible to independently control the different portions of the frequency spectrum with the result that noise reduction by suppressing the noise around the spectral valleys distorts the speech signal by sharpening the formant peaks.
Consequently, most current short term postfilters shape the spectrum such that the formants become narrower and more peaky. Whilst this reduces the noise in the valleys, it has the side effect of altering the spectral shape such that the speech becomes boomy and less natural. This effect is especially prevalent when large amounts of post filtering is applied to the signal, as is the case for Pitch Synchronous Innovation-CELP (PSI-CELP).
In accordance with one aspect of the present invention there is provided a method for calculating a short term postfilter frequency response for filtering digitally processed speech, the method comprising identifying at least one formant of the speech spectrum; and normalizing points of the speech spectrum with respect to the magnitude of an identified formant.
Using this method it is possible to independently control different portions of the frequency spectrum.
Preferably the points of the speech spectrum are normalised with respect to the magnitude of the nearest formant.
Most preferably the points of the speech spectrum are normalised according to a function of the form             R      post        ⁢          (      k      )        =            (                        R          ⁢                      (            k            )                                                R            form                    ⁢                      (            k            )                              )        β  
Where R(k) is the amplitude of the spectrum at a frequency k and Rform(k) is the amplitude of the spectrum at a frequency k which corresponds to an identified formant frequency and xcex2 controls the degree of postfiltering. Where   β  =                                                                        k                min                            -              k                                                      k                min                            -                              k                max                                              ·          γ                ⁢                  xe2x80x83                ⁢        for        ⁢                  xe2x80x83                ⁢                  k          max                     less than       k      ≤                        k          min                ⁢                  xe2x80x83                ⁢        and        ⁢                  xe2x80x83                ⁢        β              =                                                                      k                min                            -              k                                                      k                min                            -                              k                max                                              ·          γ                ⁢                  xe2x80x83                ⁢        for        ⁢                  xe2x80x83                ⁢                  k          min                     less than       k      ≤              k        max            
where k is a point in frequency, kmin is the frequency of a spectral valley, kmax is the frequency of a formant and xcex3 controls the degree of postfiltering i.e controls the depth of the postfilter valleys.
Preferably the at least one formant is identified by finding a first derivative of the speech spectrum.
In accordance with a second aspect of the present invention there is provided a postfiltering method for enhancing a digitally processed speech signal, the method comprising obtaining a speech spectrum of the digitally processed signal; identifying at least one formant of the speech spectrum; normalising points of the speech spectrum with respect to the magnitude of an identified formant to produce a postfilter frequency response; and filtering the speech spectrum of the digitally processed signal with the postfilter frequency response.
In accordance with a third aspect of the present invention there is provided a postfilter comprising identifying means for identifying at least one formant of a digitally processed speech spectrum; normalising means for normalising points of the speech spectrum with respect to the magnitude of an identified formant to produce a postfilter frequency response; means for filtering the digitally processed speech spectrum with the postfilter frequency response.
In accordance with a fourth aspect of the present invention there is provided a radiotelephone comprising a postfilter, the postfilter having identifying means for identifying at least one formant of a digitally processed speech spectrum; normalising means for normalising points of the speech spectrum with the magnitude of an identified formant to produce a postfilter frequency response; means for filtering the digitally processed speech spectrum with the postfilter frequency response.