1. Field of the Invention
The present invention relates generally to delay estimation and signal identification between audio signals. More particularly, the present invention relates to methods and systems for voice and audio quality measurement, double talk detection, signal path delay detection, signal path delay tracking, and echo cancellation, echo control or echo suppressor based on delay estimation.
2. Background Art
Subscribers use speech quality as the benchmark for assessing the overall quality of a telephone network. A key technology for providing a high quality speech is directed at elimination of the echo, using echo cancellation, echo control or echo suppression technique. Echo canceller or echo suppressor performance in a telephone network, such as a TDM or packet telephony network, has a substantial impact on the overall voice quality. An effective removal of hybrid and acoustic echo inherent in telephone networks is a key to maintaining and improving perceived voice quality during a call.
Line echoes occur in telephone networks due to impedance mismatches of network elements. Hybrid echo is the primary source of line echo generated from the public-switched telephone network (PSTN). As shown in FIG. 1, hybrid echo 110 is created by a hybrid, which connects a four-wire physical interface to a two-wire physical interface. The hybrid reflects electrical energy or audio signal back to the speaker from the four-wire physical interface.
Acoustic echo, on the other hand, is generated by analog and digital telephones or terminal audio equipment, with the degree of echo related to the type and quality of such telephones or equipment. Acoustic echoes occur in telephony terminal equipment due to electrical leakage of terminal equipment, or due to poor acoustic isolation between the microphone and speaker in handset, or due to the reflection of acoustic signal in the environment where the terminal equipment is located. As shown in FIG. 1, acoustic echo 120 is created by a voice coupling between the earpiece and microphone in the telephones, where sound from the speaker is picked by the microphone, for example, by bouncing off the walls, windows, and the like. The result of this reflection is the creation of multi-path echo, which would be heard by the speaker unless eliminated.
As shown in FIG. 1, in modern telephone networks, echo canceller 140 is typically positioned between hybrid 130 and network 150. Generally speaking, echo cancellation process involves two steps. First, as the call is set up, echo canceller 140 employs a digital adaptive filter to adapt to the far-end signal and create a model based on the far-end signal before passing through hybrid 130. After the near-end signal including the echo signal, passes through hybrid 130, echo canceller 140 subtracts the far-end model from the near-end signal to cancel hybrid echo and generate an error signal. Although this echo cancellation process removes a substantial amount of the echo, non-linear components of the echo may still remain. To cancel non-linear components of the echo, the second step of the echo cancellation process utilizes a non-linear processor (NLP) to eliminate the remaining or residual echo by attenuating the signal below the noise floor.
As shown in FIG. 1, in modern telephone networks, echo canceller 140 is typically positioned between hybrid 130 and network 150. Generally speaking, echo cancellation process involves three steps, namely, filter adaptation, non-linear processing, and double talk detection.
The echo canceller employs a digital adaptive filter to model the hybrid generating the line echo. This modeling takes place in form of gradual adaptation. The dynamics of this adaptation is controlled by the outcome of the double talk detection logic. The echo canceller adapts its filter such as to mimic the action of the hybrid on the far end signal. The echo canceller therefore regenerates the echo using its adaptive filter. This regenerated echo is subtracted from the received mixture of near end signal and echo. The output of this operation is the echo removed signal. This is also the error signal that is used to adapt the adaptive filter. The latter is adapted such that the error becomes as small as possible over time.
Although this echo cancellation process removes a substantial amount of the echo, some residual echo may still remain due to the non-linear component of the hybrid or due to error between echo canceller filter modeling the line hybrid and the actual hybrid. To cancel this residual echo, the second step of the echo cancellation process utilizes a non-linear processor (NLP) to eliminate the remaining or residual echo by attenuating the signal below the noise floor. The NLP logic is applied to the echo removed signal in absence of double talk.
An echo suppressor operate in a similar fashion as an echo canceller, except that echo suppressor and echo controller do not utilize an adaptive filter. Rather, they only utilize the NLP for echo removal. The terms echo suppression, echo control and echo reduction may be used interchangeably in the art.
Echo Cancellers employ different techniques to cover a tail length. The term tail length refers to time window within which the echo of a signal on the outgoing port may come back on the incoming input port. The tail length determines the length and the nature of the adaptive filters that may be used in echo cancellers. One example technique to cover the tail length uses a SPARSE filter.
SPARSE echo cancellers employ adaptive filter algorithms with a dynamically positioned window to cover a desired echo tail length, such as a sliding window, e.g. a 24 ms window, covering an echo path delay, e.g. a 128 ms delay. To properly cancel the echo, the echo canceller must determine the delay, which is indicative of the location of the echo signal segment or window within the 128 ms echo path delay. If the delay is not determined accurately, not only the echo signal is not properly cancelled, but also the echo canceller further distorts the signal by performing the echo cancellation at a wrong place. Therefore, it is crucial that the delay is determined accurately.
Another technique used by echo cancellers for covering the tail length is also known as selective update. In this approach, different segments of the filter, which model the echo generation process, are adapted differently. In this scenario, the knowledge of the delay can guide the echo cancellers (full or sparse) to accurately select the taps (or filter coefficients) that require special attention or some selective update scheme.
As discussed above, the role of a double talk detector is of prime importance in the operations of an echo canceller. Because the line echo canceller is utilized to cancel an echo of Rin′ signal 141 from Sin signal 132, presence of speech signal from the near end would cause the adaptive filter to converge on a combination of near end speech signal and Rin′ signal 141, which will lead to an inaccurate echo path model, i.e. incorrect adaptive filter coefficients. Therefore, in order to cancel the echo signal, the adaptive filter should not train or update the filter in the presence of the near end speech signal. To this end, conventional echo cancellers analyze Sin signal 132 and determine whether it contains the speech of a near end talker. By convention, if two people are talking over a communication network or system, one person is referred to as the “near talker,” while the other person is referred to as the “far talker.” The combination of speech signals from the near end talker and the far end talker is referred to as “double talk.” To determine whether Sin signal 132 contains double talk, a double talk detector estimates and compares the characteristics of Rin′ signal 141 and Sin signal 132. An estimate of the delay is among most important information that a double talk detector can use for accurate functioning. A purpose of the double talk detector is to prevent the adaptive filter from adapting when double talk is detected and to deactivate the operation of NLP in presence of the near end speaker.
If the double talk detector does not accurately determine the existence of a double talk condition, the adaptive filter improperly trains on a signal that includes a near end signal, and the adaptive will not accurately model the echo signal. Conversely, if the double talk detector does not accurately determine non-existence of a double talk condition, the adaptive filter does not train on Rin′ signal 141 and the adaptive will not accurately model the echo signal.
Furthermore, typically, handset or telephone equipment includes an acoustic echo canceller to cancel acoustic echo 120. However, to further control, eliminate or suppress acoustic echo, acoustic echo controller 145 may be used at central office or base station. For example, acoustic echo controller 145 is utilized to suppress the acoustic echo that is generated by the far end handset. To this end, acoustic echo controller 145 estimates the delay and suppresses acoustic echo of Sout signal 142 from Rin signal 146 received from the far end (not shown). As stated above, it is crucial that the delay is determined accurately. In fact, conventional systems determine the delay for the acoustic echo even less accurately than for the line echo due to presence of greater number of non-linear components in the acoustic echo path.
Today, there are a number of approaches to delay estimation and double talk detection. For example, one conventional approach utilizes an energy-based method, which is based on the assumption that ERL (echo return loss) is bounded, and the energy of the outgoing and incoming signals are computed and kept in a history buffer. When the incoming signal energy is below the ERL, the signal has the potential of being an echo. Some techniques immediately declare this signal as an echo, but others perform further analysis, such as averaging, and the like. Once a signal is declared as an echo, the echo canceller filter adaptation logic is activated. Although such energy-based techniques are simple and less complex, they are prone to errors at the onset and offset of speech bursts or when a large dynamic rage exists between the two talkers.
Another conventional method is correlation-based, where ones measures the cross correlation of time samples (full band or sub-band) between the outgoing and incoming signal. Although such methods are more reliable that energy-based methods when echo and distortions are linear, the correlation-based methods suffer from being computationally expensive and requiring a large memory for the retention of a sample history of the outgoing signal for the maximum possible delay length.
A further conventional approach is a statistical-based method, where one measures the cross correlation of time samples (full band or sub-band) between the outgoing and incoming signal. Like the previous approach, the statistical-based method offers reliability when echo and distortions are linear; however, it suffers from the same problems as the correlation-based method, described above.
Some conventional systems may utilize a closed-loop method, where the information provided by the adaptive filter is used to estimate the delay. In other words, if there exists a double talk detector (as described above) at the front end, it is possible for the adaptive filter to reach a level of convergence, where the shape of the adaptive filter, i.e. the location of dominant peaks is an indication of the delay(s). Such estimated delay information is in turn used by the double talk detector logic to improve the ERL estimate. This approach has several drawbacks, including the requirements of using an adaptive filter and a large buffer for retaining the sample history of the outgoing signal for the maximum possible delay length.
Further, the aforementioned conventional techniques are based on an assumption of linearity of echo generation process or a linear relation between the original signal and its echo. These conventional techniques of delay estimation fail or produce inaccurate results when the level of non-linear components in the network increases (e.g. non linear processing of signals via voice compression, none linear gains of amplifiers in terminal audio equipment, and the like.)
Accordingly, conventional methods for estimating the delay and detecting the double talk condition suffer from many disadvantages, and there is a need in the art for methods and systems to more accurate estimate the delay and/or detect the double talk condition.