The present invention is principally related to voice processing systems and, in particular, to a next generation voice processing system (NGVPS) designed specifically for voice-over-x systems and a wider class of voice processing applications.
Voice quality is critical to the success of voice-over-x (e.g., Voice-Over-IP) systems, which has led to complex, digital signal processor (DSP) intensive, voice processing solutions. For the so-called new public network to be successful in large-scale voice deployment, it must meet or exceed the voice quality standards set by today""s time division multiplex (TDM) network. These systems require a combination of virtually all known single source voice processing algorithms, which include but are not limited to the following: echo cancellation, adaptive level control, noise reduction, voice encoders and decoders (or codecs), acoustic coupling elimination and non-linear processing, voice activity detectors, double talk detection, signaling detection-relay-and-regeneration, silence suppression, discontinuous transmission, comfort noise generation and noise substitution, lost packet substitution/reconstruction, and buffer and jitter control. The current generation of voice solutions for packet networks has addressed this complex need by obtaining and plugging together separate voice sub-systems. Suppliers of these systems have concentrated their efforts in obtaining and creating each of the various blocks and making the blocks work together from an input-output perspective. During the integration process each of the functions have effectively been treated as black boxes. As a result, the sub-systems have been optimized only with regard to their function and not with respect to the complete system. This has lead to an overall sub-optimal design. The resulting systems have a reduced voice quality and require more processing power than an integrated approach, which has been optimized from a system perspective.
FIG. 1 shows a typical xe2x80x9cblack boxxe2x80x9d block diagram. The following abbreviations are used in FIG. 1: NR: noise reduction; ALC: automatic level control; ENC: speech encoder; FE: far end speaker; EC: echo canceller; SS: silence suppressor; NS: noise substitution; DEC: speech decoder; and NE: near end speaker. As shown, a transmitted voice signal 102 is processed by the echo canceller, and the pulse code modulated (PCM) output of the canceller is simply forwarded to the optional noise reduction unit, and then onto the auto level control unit, and then onto the codec, etc. A similar path is provided for received voice signals 104.
The problem with this method of simply plugging together DSP boxes is that it does not take into account the interactions of the: elements within the boxes. FIG. 2 shows some of the individual elements within the subsystems in the voice-over-x DSP system of FIG. 1. A feel for the problem can be attained by some examples; a couple of the subsystem elements that can lead to sub-optimal voice quality are examined here.
In typical fashion, a non-linear processor (NLP) is included within the echo cancellation block. The NLP is a post-processor that eliminates the small amount of residual echo that is always present after the linear subtraction of the echo estimate. One artifact of the NLP is that it can distort background noise signals. Also shown in FIG. 2 are some of the components inside the noise reduction (NR) block. The NR sub-system must generate a background noise estimate. If the NR block is not aware of the distortion introduced by the NLP, it will improperly identify the background noise resulting in lower performance. As also known in the art, there is a background noise estimate function within the speech coder subsystem. This estimate is sent to the far end voice-over-x system when the near end speaker is silent. Both the NLP and the NR block would also adversely affect this noise estimate if their actions were not taken into account.
Another interaction problem can occur with the voice activity detectors (VAD) shown in FIG. 2. The goal of the VAD is to accurately detect the presence of either NE or FE speech. If speech is present, then the associated processing of the ALC, NR, or speech coder is performed. The echo canceller""s double talk detector (DTD) is another form of VAD. It must detect both NE and FE speech and control the canceller so that it only adapts when NE speech is absent. Interaction between the elements such as the NLP, NR, or changes in the ALC can negatively affect the accuracy of the downstream VAD. For example, losses in the NLP or NR subsystems may falsely trigger the speech encoder to misinterpret voice as silence. This would cause the codec to clip the NE speech, which would degrade voice quality. Similar issues exist with regard to the VAD in the ALC block.
Thus, a need exists for an improved voice processing system that does not suffer from the interactive shortcomings of prior art solutions.
The present invention provides a next-generation voice processing system (NGVPS) designed with the overall system in mind. Each voice-processing block has been opened up revealing common functions and inter-block dependencies. By opening up these blocks, the NGVPS also enhances the functionality of some functions by using processing and signals that were previously only available to a single block. By taking into account the interaction of these various sub-systems and elements, the. NGVPS provides the best overall voice performance. This holistic approach provides new means for optimizing voice processing from an end-to-end systems approach. This will be an important factor in the success of the new network.
A more system-wide optimization approach is described herein. This approach takes into account the interaction of the various sub-systems and elements to provide the best overall voice performance. For the so-called new public network to be successful in large-scale voice deployment, it must meet and should exceed the voice quality standards set by today""s TDM network. Therefore, optimizing voice processing from an end-to-end systems approach is a critical success factor in new network design.
The system-wide, integrated voice processing approach of the present invention also creates opportunities for further enhancements by reordering of the sub-blocks, which make up the various blocks. For example, work has been conducted in the past on sub-band NLPs for echo cancellers. However, the significant processing required to create the sub-bands has typically been an over-riding factor with respect to the performance improvements. However, a NR system typically divides the signal into sub-bands in order to perform its operations. Opening up these blocks facilitates a system in which the EC""s NLP can be moved to the sub-band part of the NR system. Thus, the performance improvement may be gained with very little additional processing.
The new public network concept, which is based on packet voice, requires this type of processing at each point of entry and departure from the network. Establishing a more integrated system, having the best performing processing elements at these points, is one of the objectives of the present invention. The present invention may be applicable to voice band enhancement products or voice-over-x products. Additional applications that could benefit from the present invention include any other products carrying-out voice processing.