The transmission of speech, e.g. by mobile phones and IP-phones, normally involves speech coding, which is the compression of speech into a code for transmission with speech codecs. The CELP (Code-Excited Linear Predictive)-coding is a commonly used speech coding method comprising two stages, i.e. a linear predictive stage that models the spectral envelope and a code-book stage that models the residual of the linear predictive stage.
In addition to the actual speech coding of the signal, channel coding may be used for the transmission of the signal in order to avoid losses due to transmission errors, and the most important bits in the speech data stream is often coded by the more robust channel coding, in order to get the best overall coding results.
It is important to reduce noise and disturbances in order to improve the speech quality in a mobile phone. The echoes, i.e. reflections of a Voice signal back to the speaking party, are a major disturbance, and the main echo source in a telephone network is the electrical reflection in the so-called hybrid circuit caused by impedance mismatch of the 4-wire to 2-wire conversion in the local exchanges of the PSTN (Public Switched Telephony Network). Normally, this electrical echo is removed by network echo cancellers installed close to the echo source in the telephone system, e.g. in the media gateways functioning as an interface between a packet switched network, using e.g. the IP (Internet Protocol) and a circuit switched network, e.g. the PSTN, or in the Mobile Services Switching Centres functioning as an interface between mobile networks and the PSTN. Network echo cancellers are also required in international exchanges, and may be needed in national telephone exchanges having a large end-to-end transmission delay. Further, if no echo canceller is present in a telephone exchange close to the echo source, an international operator in another country may want to reduce the echo by detecting the and removing the echo generated in the distant telephone exchange.
Another echo source within a mobile communication network is the acoustic crosstalk occurring inside a mobile phone or an IP-phone, caused by acoustical coupling between the microphone and loudspeaker. In order to reduce the acoustical coupling in accordance with the standard requirements, a mobile phone normally provides echo attenuation. However, even though a mobile phone provides echo attenuation according to the requirements, echo originating from acoustic crosstalk may still occur, e.g. due to large variations in the position of the mobile phone or deviations of the line levels from the nominal levels.
While a conventional network echo canceller is capable of controlling the electrical echo, an echo originating from acoustic crosstalk requires a different echo canceller. Since the signals in a mobile communication network are coded in a speech coder and then transmitted over a radio channel that introduces bit-errors, the echo path will be nonlinear and non-stationary and introduce an unknown delay. Thereby, a conventional network echo canceller is unable to handle acoustic echoes returned from mobile phones.
Conventionally, echo control includes determination of whether a received speech-signal is dominated by a component originating in the vicinity of the receiver, i.e. from a so-called near-end, or by reflections, an echo, of a known speech signal originating from a distance, i.e. from a so-called far end. A reflected known speech signal from a far end, i.e. an echo, will be delayed, transformed and mixed with the speech signal and noise originating from the near end. This is illustrated schematically in FIG. 1, showing a first mobile phone 1a and a second mobile phone 1b. A first speech signal 3 is transmitted from the first mobile phone 1a and delayed and transformed in the first network path 2a, before reaching the second mobile phone 1b. However, a reflected portion 4 of this speech signal will be reflected and returned through the second network path 2b to be received by the first mobile phone 1a as an echo of the known first speech signal 3. Thus, this echo signal, i.e. the far-end signal, received by the first mobile phone originates from the first speech signal, passing both networks paths 2a, 2b. 
A second speech signal 5 transmitted from the second mobile phone 1b will be added to the echo signal 4 originating from the first speech signal 3. Thus, a received speech signal 6 reaching the first mobile phone 1a will comprise both an echo signal component 4, i.e. the far end-signal, and this second speech signal component 5, i.e. the near end-signal, which is unknown to the first mobile phone 1a. A received speech signal 6 that is dominated by a near end-signal 5, and not by an echo-signal 4, may be referred to as double talk, and the determination that a speech signal is dominated by a near end-signal is hereinafter referred to as double talk-detection. The far-end component of the received signal 6 that is a reflection of the first speech signal 3 may be suppressed by an echo control device in order to reduce the disturbances and noise.
An echo control device normally estimates the characteristics of an echo path, and this estimation will be disturbed by an unknown speech signal originating from a near end. Therefore, a conventional echo control devices avoids estimating the characteristics of the echo path in the presence of speech originating from a near end. Instead, the echo control device will detect the presence of near end-speech by the above described double talk detection, and the estimation of the echo path characteristics will be inactivated or disabled during the periods when the received signal is dominated by the near end talk.
The double talk detection can be performed e.g. by comparing the signal levels of the near end-component and the far end-component in order to detect the double talk, such as e.g. by a Geigel detector, as described e.g. by D. L Duttweiler in “A twelve-channel digital echo canceller”, IEEE Transactions on Communications, Vol. COM-26, No. 5, May 1978. However, the accuracy of this double talk detection is comparatively low, since it assumes that the echo signal power is always lower than the constant times far end signal power, and double talk is declared if the signal returned from near end has higher short term power than the constant times far end signal power. Thereby, the detector will miss any weak double talk condition, caused by difference in line levels, or by the near end speaker talking with a lower voice than the far end speaker. Additionally, this constant may be difficult to determine, in particular for acoustic echo, which may be stronger than the far end signal causing it, due to amplification in the echo path.
Alternatively, the double talk detection includes computing of the cross correlation, covariance or coherence functions of the near end-component and the far end-component, as described e.g. in the U.S. Pat. No. 6,035,034 and U.S. Pat. No. 6,766,019. This results in an improved detection performance, but requires a higher computational complexity.
As described above, the speech signals in a mobile telecommunication network are normally transported in a coded format, and the AMR (Adaptive Multi-Rate) is an example of an audio data compression scheme optimized for speech coding. The AMR is commonly used to code the speech signals in GSM-(Global System for Mobile communication) and UMTS-(Universal Mobile Telecommunication System) networks, and it involves link adaptation to select from one of eight different bit rates based on link conditions. The AMR may use different techniques, such as e.g. the above-described CELP, or DTX (Discontinuous Transmission), VAD (Voice Activity Detection) or CNG (Comfort Noise Generation), and the link adaptation may select the best codec mode to meet the local radio channel and capacity requirements. In case of poor radio transmission, the channel coding will increase, which will improve the quality and robustness of the network connection, but will lead to a deteriorated voice signal.
Similarly, IP-telephony speech signals are normally coded in the sending mobile phone and transported over the network to another mobile terminal/phone, without any decoding in the network.
Thus, the network echo control will have to be applied on the coded signals, preferably by modifying the parameters in the coded bit-stream directly, without decoding the signals, and without performing a second encoding after removal of the echo, since decoding followed by coding may destroy the positive speech quality-effects of the TFO (Tandem Free Operation) and the TrFO. (Transcoder Free Operation) that is normally introduced in modern telecommunication networks in order to enhance the speech quality.
An additional drawback in conventional double talk detection is that signal waveforms are needed for the computation of the detection variable, requiring decoding of the speech signal before the detection. However, the ability to work directly on coded bit-stream is becoming increasingly important due to the use of TrFO (Transcoder Free Operation) and TFO (Tandem Free Operation) in order to enhance the speech quality, since decoding followed by coding reduces the positive speech quality-effects of the TFO (Tandem Free Operation) and the TrFO. Transcoder Free Operation).
Further, since network echo control normally involves double talk detection, i.e. determination that a received speech signal is dominated by a near end-signal, an improved double talk detection will improve the network echo control.
Therefore, it still presents a problem to achieve an improved and accurate double talk detection that is applicable on a coded speech signal.