In some applications it is necessary to detect presence of a possibly modified version of a known speech signal in a received signal that may consist of several speech and noise components and to estimate the relative delay of the component of interest. Examples of such applications are echo control, network statistics collection and multi-party conference bridges.
The underlying problem is illustrated in FIG. 1. A known speech signal is delayed in a delay block 10 and is affected by an unknown transformation 12 on its way to a summation point 14. It may or may not reach the summation point (switch 16 may be open or closed). In the summation point the signal is mixed with other speech signals and noise. On its way back to the point from which the original signal was transmitted, the signal from summation point 14 is again altered by an unknown transformation 18 and a delay block 20. The problem is to detect whether a possibly modified version of the original known speech signal is present in the received signal, and if yes, to estimate its relative delay with respect to the known speech signal. This is performed by a detection and delay estimation block 22.
A phenomenon of hearing delayed reflections of ones own voice is referred to as echo. In a telephone network, the main source of the echo is electrical reflection in the so-called hybrid circuit connecting the 4-wire part of the network with a two-wire subscriber line. This electrical echo is commonly handled by network echo cancellers installed in the telephone system. The network echo canceller should normally be installed close to the echo source. For example, network echo cancellers are required in the media gateways interfacing packet networks (IP or ATM) to PSTN networks or Mobile Services Switching Centres interfacing mobile networks to PSTN networks. Similarly network echo cancellers should be installed in international exchanges, and in some situations in telephone exchanges inside one country if the end-to-end transmission delay exceeds 25 ms, see [1]. In some cases the network echo canceller may, however, be missing in its proper location i.e. in a telephone exchange close to the echo source. If this is the case, long distance calls to and from such a location suffer from echo problems. An international operator in another country may want to solve the problem for its own customers by detecting the calls with echo generated in the distant location and take proper measures for removing the echo. To do so, it is necessary to detect the echo and estimate its delay.
Another echo source is acoustical coupling between loudspeaker and microphone of a telephone (terminal). This type of echo may be returned from e.g. mobile terminals or IP phones. Ideally the terminals should handle their own echoes in such a way that no echo is transmitted back to the system. Even though many of the terminals currently in use are able to handle their own echoes properly, there are still models that do not.
The acoustical echo problem is not easy to solve in the network, see [2], since the echo path includes speech encoders and decoders. Furthermore, in the case of mobile networks, the signals are transmitted over a radio channel that introduces bit-errors in the signal. This makes the echo path nonlinear and non-stationary and introduces an unknown delay in the echo path, so that ordinary network echo cancellers are generally not able to cope with acoustical echoes returned from mobile terminals. Again, in order to cope with the echoes one first needs to detect whether the echo is present in the call, and if yes, to estimate its delay.
Another application where this type of detection is useful, is network statistics collection. A telecom operator may wish to collect various statistical data related to the quality of phone calls in its network. Some of the statistics of interest are the presence of echoes returned from terminals (e.g. mobile phones or IP phones) and the delay associated with these echoes. To accomplish this task the statistics collection unit could include a detection and delay estimation block 22 as illustrated in FIG. 1. In this example the detection and estimation results would be stored in a database for later use as opposed to the immediate use of the results for echo control in the previous examples. The statistics stored in the database can be used to present aggregated network statistics. They can also be used for trouble shooting if customer complaints regarding speech quality are received by the operator.
Yet another application is a multi-party conference bridge, see [3]. In a multi-party bridge for a telecommunication system the incoming microphone signals from the different parties are digitally mixed and transmitted to the loudspeaker of the different parties. As an example, in a basic embodiment the incoming signals from all parties may be mixed and transmitted to all parties. For certain reasons, e.g. to reduce the background noise level of the transmitted signal, some implementations of multi-party bridges only mix the incoming signals from a fixed subset of the parties. This choice is typically performed on the basis of signal level and speaker activity of the different parties, where the most recent active talkers are retained if no speaker activity is present from any other party. A further modification to the basic operation is that the microphone signal coming from a party A may be excluded from the sum of the signal transmitted back to party A. Reasons for this are that the microphone signal from party A already is present in the loudspeaker of talker A (due to the side-tone in the telephone set), and that if a significant transmission delay is present in the system, the microphone signal will be perceived as an undesirable echo.
With the increased use of various mobile terminals (e.g. cellular phones), situations where two or more users in a conference call may be located in the same location will become more common. In these situations the speech from user A will also be present as input to the microphone of user B. With a significant transmission delay, this signal coming from the microphone of user B will introduce an undesirable talker echo to user A. Furthermore, the microphone signal from user A will be transmitted to the loudspeaker of user B. Due to the direct path of the voice between talker A and user B this may, with a significant transmission delay in the system, cause user B to experiences a listener echo of talker A. Similarly, if both the microphone signals from users A and B are transmitted to the other parties, this signal may contain an undesirable listener echo of talker A. Hence, there is a need for detecting cross-talk between two incoming lines to a multi-party conference bridge and control the transmission to the respective users based on this detected cross-talk.
In this specification the component of the received signal originating from the known signal will be referred to as echo.
There are several ways of detecting echo signals. For example one can use a set of short adaptive filters spanning the delay range of interest and an associated histogram to determine whether an echo signal is present and estimate its delay. This solution is described in [1]. A problem with this solution is its high computational cost.
Another known method is to correlate uplink and downlink signal power for several delays of interest. The echo can be detected based on observations of power correlation between uplink and downlink over a period of time. The echo is detected if the is power correlation for a certain delay has been pre-sent over a sufficiently long period of time. If the echo is detected for several delays, the delay where the power correlation is largest is selected as the delay estimate, see [5]. A problem with this solution is its slow convergence (the power correlation must be present over a sufficiently long time to detect echo and estimate its delay reliably).
A common drawback for both of the described methods is that they cannot be applied to coded speech directly without decoding the speech signals first. The ability to work directly on the coded bit-stream is becoming increasingly important as Transcoder Free Operation (TrFO) and Tandem Free Operation (TFO) are being introduced in the networks.