This invention relates to a nearend speech detector and a method for classifying speech at a communication system.
In telephony, audio signals (e.g. including voice signals) are transmitted between a near-end and a far-end. Far-end signals which are received at the near-end may be outputted from a loudspeaker. A microphone at the near-end may be used to capture a near-end signal to be transmitted to the far-end. An “echo” occurs when at least some of the far-end signal outputted at the near-end is included in the microphone signal which is transmitted back to the far-end. In this sense the echo may be considered to be a reflection of the far-end signal.
An example scenario is illustrated in FIG. 1, which shows a signal being captured by a far-end microphone and output by a near-end loudspeaker. The echo is a consequence of acoustic coupling between the loudspeaker and a microphone at the near-end; the near-end microphone captures the signal originating from its own loudspeaker in addition to the voice of the near-end speaker and any near-end background noise. The result is an echo at the far-end loudspeaker. In Voice over IP (VoIP) communication systems, echoes can be especially noticeable due to the inherent delays introduced by the audio interfaces of VoIP communication devices.
In order to remove the unwanted echo from a microphone signal and recover the neared voice signal, an estimate of the echo may be formed and cancelled from the microphone signal. Such an estimate is typically synthesised at an adaptive echo estimation filter (AEEF) from the far-end voice signal. This arrangement is shown in FIG. 2 in which an AEEF 203 forms an estimate of the echo e from farend signal x, and the echo signal is then subtracted 204 from the microphone signal m so as to form an estimate of the true nearend signal d, from which the echo of the farend signal has been cancelled. The performance of such an echo cancellation arrangement depends on the adaptation control of the adaptive echo estimation filter (AEEF).
Under certain conditions it is necessary to freeze the coefficients of the AEEF or apply a negligible step size—for example, during presence of nearend signal in the microphone signal. Adapting the coefficients of the AEEF during presence of nearend signal is likely to lead to divergence of AEEF. A nearend speech detector (NSD) may be employed to detect the presence of nearend speech and its output used to decide when to freeze the coefficients of the AEEF and prevent their adaptation. This preserves echo path modelling and echo cancellation stability during the presence of nearend speech. A nearend speech detector may also detect the onset of double talk (and is sometimes referred to as a double talk detector, or DTD). This is because during double talk both nearend are farend speech is present, leading to the same divergence problem if the coefficients of the AEEF are permitted to adapt. A typical arrangement of a nearend speech detector 205 with respect to an AEEF is shown in FIG. 2.
Conventional algorithms for nearend speech detectors (NSD) use parameters of the AEEF itself to produce a binary signal used for either deciding whether the filter coefficients of the AEEF should be frozen or can be allowed to adapt, or determining a suitable step size for the filter (e.g. in accordance with an echo to nearend signal ratio). The performance of such algorithms thus depends on the performance of the AEEF. If the AEEF has not converged, the NSD may detect echo as nearend leading to a slow rate of convergence. On some of the platforms, the AEEF may never converge to its optimum set of coefficients due to platform non-linearity, low echo to noise ratio (ENR), etc. In such cases, the NSD may not work properly during the entire session of a voice call.
Various improvements on the conventional algorithms for nearend speech detectors have been proposed which do not depend on the parameters of an adaptive echo canceller. The Geigel DTD algorithm published by D. L. Duttweiler as “A twelve channel digital echo canceler”, IEEE Transactions on Communications, 26(5):647-653, May 1978 has proven successful in line echo cancellers. However, it does not always provide reliable performance when used in echo cancellers under different ratios of echo signal to nearend signal. Methods based on cross-correlation have also been proposed, such as V. Das et al., “A new cross correlation based double talk detection algorithm for nonlinear acoustic echo cancellation”, TENCON 2014 IEEE Region 10 Conference, pages 1-6, October 2014, as have methods based on coherence, such as T. Gansler et al., “A double-talk detector based on coherence”, IEEE Transactions on Communications, 44(11):1421-1427, November 1996. However, these approaches suffer from poor performance under non-linearity and double talk.
Recently, blind source separation (BSS) techniques have been proposed to perform echo cancellation during the double-talk, such as Y. Sakai and M. T. Akhtar, “The performance of the acoustic echo cancellation using blind source separation to reduce double-talk interference”, 2013 International Symposium on Intelligent Signal Processing and Communications Systems (ISPACS), pages 61-66, November 2013. Similarly, M. Kanadi et al., “A variable step-size-based ICA method for a fast and robust acoustic echo cancellation system without requiring double-talk detector, 2013 IEEE China Summit International Conference on Signal and Information Processing (ChinaSIP), pages 118-121, July 2013 proposes independent component analysis (ICA) for BSS to separate echo and nearend from the microphone signal. The separated echo is then applied to adapt the AEEF. Since these BSS methods are based on long block processing, they suffer from considerable delay in nearend speech detection and slow convergence speed. In addition, the use of techniques such as singular value decomposition (SVD) on the farend signal in order to detect periods of double-talk is computationally expensive and depends on the estimation error present in the EEF.