1. Field of the Invention
The invention concerns a method for the multireference correction of voice spectral deformations introduced by a communication network. It also concerns a system for implementing the method.
The aim of the present invention is to improve the quality of the speech transmitted over communication networks, by offering means for correcting the spectral deformations of the speech signal, deformations caused by various links in the network transmission chain.
The description which is given of this hereinafter explicitly makes reference to the transmission of speech over “conventional” (that is to say cabled) telephone lines, but also applies to any type of communication network (fixed, mobile or other) introducing spectral deformations into the signal, the parameters taken as a reference for specifying the network having to be modified according to the network.
2. Description of Prior Art
The various deformations encountered in the case of the switched telephone network (STN) will be stated below.
1.1. Degradations in the Timbre of the Voice on the STN Network:
FIG. 1 depicts a diagram of an STN connection. The speech emitted by a speaker is transmitted by a sending terminal 10, is transported by the subscriber line 20, undergoes an analogue to digital conversion 30 (law A), transmitted by the digital network 40, undergoes a digital (law A) to analogue conversion 50, is transmitted by the subscriber link 60, and passes through the receiving terminal 70 in order finally to be received by the destination person.
Each speaker is connected by an analogue line (twisted pair) to the closest telephone exchange. This is a base band analogue transmission referenced 1 and 3 in FIG. 1. The connection between the exchanges follows an entirely digital network. The spectrum of the voice is affected by two types of distortion during the analogue transmission of the base band signal.
The first type of distortion is the bandwidth filtering of the terminals and the points of access to the digital part of the network. The typical characteristics of this filtering are described by UIT-T under the name “intermediate reference system” (IRS) (UIT-T, Recommendation P.48, 1988). These frequency characteristics, resulting from measurements made during the 1970s, are tending however to become obsolete. This is why the UIT-T has recommended since 1996 using a “modified” IRS (UIT-T, Recommendation P.830, 1996), the nominal characteristic of which is depicted in FIG. 2 for the transmission part and in FIG. 3 for the receiving part. Between 200 and 3400 Hz, the tolerance is ±2.5 dB; below 200 Hz, the decrease in the characteristic of the global system must be at least 15 dB per octave. The transmission and reception parts of the IRS are called respectively, according to the UIT-T terminology, the “transmitting system” and the “receiving system”.
The second distortion affecting the voice spectrum is the attenuation of the subscriber lines. In a simple model of the local analogue line (given in a CNET Technical Note NT/LAA/ELR/289 by Cadoret, 1983), it is considered that this introduces an attenuation of the signal whose value in dB depends on its length and is proportional to the square root of the frequency. The attenuation is 3 dB at 800 Hz for an average line (approximately 2 km), 9.5 dB at 800 Hz for longer lines (up to 10 km). According to this model, the expression for the attenuation of a line, depicted in FIG. 4, is:
                                          A            dB                    ⁡                      (            f            )                          =                                            A              dB                        ⁡                          (                              800                ⁢                                                                  ⁢                Hz                            )                                ⁢                                    f              800                                                          (        0.1        )            
To these distortions there is added the anti-aliasing filtering of the MIC coder (ref 30). The latter is typically a 200-3400 Hz bandpass filter with a response which is almost flat over the bandwidth and high attenuation outside the band, according to the template in FIG. 5 for example (National Semiconductor, August 1994: Technical Documentation TP3054, TP3057).
Finally, the voice suffers spectral distortion as depicted in FIG. 6 for the various combinations of three types of analogue line in transmission and reception (that is to say 6 distortions), assuming equipment complying with the nominal characteristic of the modified SRI. The voice thus appears to be stifled if one of the analogue lines is long and in all cases suffers from a lack of “presence” due to the attenuation of the low-frequency components.
1.2. Degradations in the Timbre of the Voice on the Isdn Network and the GSM Mobile Network
In ISDN and the GSM network, the signal is digitised as from the terminal. The only analogue parts are the transmission and reception transducers associated with their respective amplification and conditioning chains. The UIT-T has defined frequency efficacy templates for transmission depicted in FIG. 7, and for reception depicted in FIG. 8, valid both for cabled digital telephones (UIT-T, Recommendation P.310, May 2000) and mobile digital or wireless terminals (UIT-T, Recommendation P.313, September 1999).
Moreover, for GSM networks, it is recognised that coding and decoding slightly modify the spectral envelope of the signal. This alteration is shown in FIG. 9 for pink noise coded and then decoded in EFR (Enhanced Full Rate) mode.
The effect of these filterings on the timbre is mainly an attenuation of the low-frequency components, less marked however than in the case of STN.
The invention concerns the correction of these spectral distortions by means of a centralized processing, that is to say a device installed in the digital part of the network, as indicated in FIG. 10 for the STN.
The objective of a correction of the voice timbre is that the voice timbre in reception is as close as possible to that of the voice emitted by the speaker, which will be termed the original voice.
2. Prior Art
Compensation for the spectral distortions introduced into the speech signal by the various elements of the telephone connection is at the present time allowed by devices with an equalization base. The latter can be fixed or be adapted according to the transmission conditions.
2.1. Fixed Equalization
Centralised equalization devices were proposed in the patents U.S. Pat. Nos. 5,333,195 (Duane O. Bowker) and 5,471,527 (Helena S. Ho). These equalizers are fixed filters which restore the level of the low frequencies attenuated by the transmitter. Bowker proposes for example a gain of 10 to 15 dB on the 100-300 Hz band. These methods have two drawbacks:                The equalizer compensates only for the filtering of the transmitter, so that on reception the low-frequency components remain greatly attenuated by the IRS reception filtering.        This fixed equalization compensates for the average transmission conditions (transmission system and line). If the actual conditions are too different (for example if the analogue lines are long) the device does not sufficiently correct the timbre, or even impairs it more than the connection without equalization.        
2.2. Adaptive Equalization
The invention described in the patent U.S. Pat. No. 5,915,235 (Andrew P De Jaco) aims to correct the non-ideal frequency response of a mobile telephone transducer. The equalizer is described as being placed between the analogue to digital converter and the CELP coder but can be equally well in the terminal or in the network. The principle of equalization is to bring the spectrum of the received signal close to an ideal spectrum. Two methods are proposed.
The first method (illustrated by FIG. 4 in the aforementioned patent of De Jaco) consists of calculating long-term autocorrelation coefficients RLT:RLT(n,i)=αRLT(n−1,i)+(1−α)R(n,i),  (0.2)
with RLT(n,i) the ith long-term autocorrelation coefficient to the nth frame, R(n,i) the ith autocorrelation coefficient specific to the nth frame, and α a smoothing constant fixed for example at 0.995. From these coefficients there are derived the long-term LPC coefficients, which are the coefficients of a whitening filter. At the output of this filter, the signal is filtered by a fixed signal which imprints on it the ideal long-term spectral characteristics, i.e. those which it would have at the output of a transducer having the ideal frequency response. These two filters are supplemented by a multiplicative gain equal to the ratio between the long-term energies of the input of the whitener and the output of the second filter.
The second method, illustrated by FIG. 5 of the aforementioned De Jaco patent, consists of dividing the signal into sub-bands and, for each sub-band, applying a multiplicative gain so as to reach a target energy, this gain being defined as the ratio between the target energy of the sub-band and the long-term energy (obtained by a smoothing of the instantaneous energy) of the signal in this sub-band.
These two methods have the drawback of correcting only the non-ideal response of the transmission system and not that of the reception system.
The object of the device of the patent U.S. Pat. No. 5,905,969 (Chafik Mokbel) is to compensate for the filtering of the transmission signal and of the subscriber line in order to improve the centralised recognition of the speech and/or the quality of the speech transmitted. As presented by FIG. 3a in Mokbel, the spectrum of the signal is divided into 24 sub-bands and each sub-band energy is multiplied by an adaptive gain. The matching of the gain is achieved according to the stochastic gradient algorithm, by minimisation of the square error, the error being defined as the difference between the sub-band energy and a reference energy defined for each sub-band. The reference energy is modulated for each frame by the energy of the current frame, so as to respect the natural short-term variations in level of the speech signal. The convergence of the algorithm makes it possible to obtain as an output the 24 equalized sub-band signals.
If the application aimed at is the improvement in the voice quality, the equalized speech signal is obtained by inverse Fourier transform of the equalized sub-band energy.
The Mokbel patent does not mention any results in terms of improvement in the voice quality, and recognises that the method is sub-optimal, in that it uses a circular convolution. Moreover, it is doubtful that a speech signal can be reconstructed correctly by the inverse Fourier transform of band energies distributed according to the MEL scale. Finally, the device described as not correct the filtering of the reception signal and of the analogue reception line.
The compensation for the line effect is achieved in the “Mokbel” method of cepstral subtraction, for the purpose of improving the robustness of the speech recognition. It is shown that the cepstrum of the transmission channel can be estimated by means of the mean cepstrum of the signal received, the latter first being whitened by a pre-accentuation filter. This method affords a clear improvement in the performance of the recognition systems but is considered to be an “off-line” method, 2 to 4 seconds being necessary for estimating the mean cepstrum.
2.3. Another state of the art combines a fixed pre-equalization with an adapted equalization and has been the subject of the filing of a patent application FR 2822999 by the applicant. The device described aims to correct the timbre of the voice by combining two filters.
A fixed filter, called the pre-equalizer, compensates for the distortions of an average telephone line, defined as consisting of two average subscriber lines and transmission and reception systems complying with the nominal frequency responses defined in UIT-T, Recommendation P.48, App.I, 1988. Its frequency response on the Fc-3150 Hz band is the inverse of the global response of the analogue part of this average connection, Fc being the limit equalization low frequency.
This pre-equalization is supplemented by an adapted equalizer, which adapts the correction more precisely to the actual transmission conditions. The frequency response of the adapted equalizer is given by:
                                                                    EQ              ⁡                              (                f                )                                                          =                                    1                                                                S_RX                  ⁢                                                            (                      f                      )                                        ·                    L_RX                                    ⁢                                      (                    f                    )                                                                                        ⁢                                                                                γ                    ref                                    ⁡                                      (                    f                    )                                                                                        γ                    x                                    ⁡                                      (                    f                    )                                                                                      ,                            (        0.3        )            
with L_RX the frequency response of the reception line, S_RX the frequency response of the reception system and γx(f) the long-term spectrum of the output x of the pre-equalizer.
The long-term spectrum is defined by the temporal mean of the short-term spectra of the successive frames of the signal; γref(f), referred to as the reference spectrum, is the mean spectrum of the speech defined by the UIT (UIT-T/P.50/App. I, 1998), taken as an approximation of the original long-term spectrum of the speaker. Because of this approximation, the frequency response of the adapted equalizer is very irregular and only its general shape is pertinent. This is why it must be smoothed. The adapted equalizer being produced in the form of a time filter RIF, this smoothing in the frequency domain is obtained by a narrow windowing (symmetrical) of the pulsed response.
This method makes it possible to restore a timbre close to that of the original signal on the equalization band (Fc-3150 Hz), but:                for some speakers, the approximation of their original long-term spectrum by means of the reference spectrum is very rough, so that the equalizer introduces a perceptible distortion;        the high smoothing of the frequency response of the equalizer, made necessary by the approximation error, prevents fine spectral distortions from being corrected.        