Voice quality is critical to the success of voice-over-x (e.g., Voice-Over-IP) systems, which has led to complex, digital signal processor (DSP) intensive, voice processing solutions. For the so-called new public network to be successful in large-scale voice deployment, it must meet or exceed the voice quality standards set by today's time division multiplex (TDM) network. These systems require a combination of virtually all known single source voice processing algorithms, which include but are not limited to the following: echo cancellation, adaptive level control, noise reduction, voice encoders and decoders (or codecs), acoustic coupling elimination and non-linear processing, voice activity detectors, double talk detection, signaling detection-relay-and-regeneration, silence suppression, discontinuous transmission, comfort noise generation and noise substitution, lost packet substitution/reconstruction, and buffer and jitter control. The current generation of voice solutions for packet networks has addressed this complex need by obtaining and plugging together separate voice sub-systems.
Suppliers of these systems have concentrated their efforts in obtaining and creating each of the various blocks and making the blocks work together from an input-output perspective. During the integration process each of the functions have effectively been treated as black boxes. As a result, the sub-systems have been optimized only with regard to their function and not with respect to the complete system. This has lead to an overall sub-optimal design. The resulting systems have a reduced voice quality and require more processing power than an integrated approach, which has been optimized from a system perspective.
FIG. 1 shows a typical “black box” block diagram. The following abbreviations are used in FIG. 1: NR: noise reduction; ALC: automatic level control; ENC: speech encoder; FE: far end speaker; EC: echo canceller; SS: silence suppressor; NS: noise substitution; DEC: speech decoder; and NE: near end speaker. As shown, a transmitted voice signal 102 is processed by the echo canceller, and the pulse code modulated (PCM) output of the canceller is simply forwarded to the optional noise reduction unit, and then onto the auto level control unit, and then onto the codec, etc. A similar path is provided for received voice signals 104.
The problem with this method of simply plugging together DSP boxes is that it does not take into account the interactions of the elements within the boxes. FIG. 2 shows some of the individual elements within the subsystems in the voice-over-x DSP system of FIG. 1. A feel for the problem can be attained by some examples; a couple of the subsystem elements that can lead to sub-optimal voice quality are examined here.
In typical fashion, a non-linear processor (NLP) is included within the echo cancellation block. The NLP is a post-processor that eliminates the small amount of residual echo that is always present after the linear subtraction of the echo estimate. One artifact of the NLP is that it can distort background noise signals. Also shown in FIG. 2 are some of the components inside the noise reduction (NR) block. The NR sub-system must generate a background noise estimate. If the NR block is not aware of the distortion introduced by the NLP, it will improperly identify the background noise resulting in lower performance. As also known in the art, there is a background noise estimate function within the speech coder subsystem. This estimate is sent to the far end voice-over-x system when the near end speaker is silent. Both the NLP and the NR block would also adversely affect this noise estimate if their actions were not taken into account.
Another interaction problem can occur with the voice activity detectors (VAD) shown in FIG. 2. The goal of the VAD is to accurately detect the presence of either NE or FE speech. If speech is present, then the associated processing of the ALC, NR, or speech coder is performed. The echo canceller's double talk detector (DTD) is another form of VAD. It must detect both NE and FE speech and control the canceller so that it only adapts when NE speech is absent. Interaction between the elements such as the NLP, NR, or changes in the ALC can negatively affect the accuracy of the downstream VAD. For example, losses in the NLP or NR subsystems may falsely trigger the speech encoder to misinterpret voice as silence. This would cause the codec to clip the NE speech, which would degrade voice quality. Similar issues exist with regard to the VAD in the ALC block.
Thus, a need exists for an improved voice processing system that does not suffer from the interactive shortcomings of prior art solutions.