1. Field
This disclosure relates generally to a communication system and, more specifically, to techniques for comfort noise generation in a communication system.
2. Related Art
The process of distinguishing conversational speech from silence, music, noise, or other non-speech signals is generally known as voice activity detection (VAD). VAD may be implemented in a communication system using various speech processing algorithms that facilitate detection of speech. VAD may also indicate whether speech is voiced, unvoiced, or sustained. In general, known VAD algorithms trade-off delay, sensitivity, accuracy, and computational cost. To detect voice, a VAD algorithm usually extracts measured features from an input signal and compares values associated with the features with predetermined thresholds. When VAD is employed with non-stationary noise, a time-varying threshold (calculated during voice-inactive segments) is usually employed. VAD algorithms usually formulate decision rules on a frame-by-frame basis using instantaneous measures of divergence distance between speech and noise. The different measures which are used in VAD algorithms may include spectral slope, correlation coefficients, logarithm likelihood ratio, cepstral, weighted cepstral, and modified distance measures.
Most modern telephone systems (such as wireless and voice over Internet protocol (VoIP) systems) use VAD as a form of squelching, such that low-level signals are ignored. In digital transmissions, ignoring low-level signals conserves bandwidth of a communication channel by discontinuing transmission when a signal level is below a threshold. When a telephony customer detects silence, especially for a prolonged time period, the customer may believe that a transmission has been dropped and hang-up prematurely. In order to prevent premature hang-up, comfort noise has been added (e.g., at a receiver-end in wireless and VoIP systems) between voice transmissions. The generated comfort noise has usually been at a relatively low audible level, and has typically varied based on an average of a received signal.
Echo cancellation is used in telephony to remove echo from a voice communication in order to improve voice quality. Echo cancellation involves first recognizing an originally transmitted signal that re-appears, with some delay, in a transmitted or received signal. Upon recognition, an echo can be removed by subtracting the echo from a transmitted or received signal. Echo cancellation is generally implemented using a digital signal processor (DSP).
Two primary sources of echo in telephony are acoustic echo and hybrid echo. Acoustic echo arises when sound from a speaker of a telephone handset is picked up by a microphone of the telephone handset. For example, acoustic echo may occur in conjunction with hands-free car phone systems, a standard telephone in speakerphone or hands-free mode, conference telephones, installed room systems that use ceiling speakers and table-top microphones, video conferencing systems, etc. Direct acoustic path echo is attributable to sound from a speaker of a handset that enters a microphone of the handset substantially unaltered. When indirect acoustic path echo (reverberation) occurs, the echo can be difficult to effectively cancel (unlike echo associated with a direct acoustic path) as the original sound is altered by ambient space. The altered echo may be attributed to certain frequencies being absorbed by soft furnishings and reflection of different frequencies at varying strength.
Acoustic echo cancellers are usually designed to deal with changes and additions to an original signal caused by imperfections of a speaker, imperfections of a microphone, reverberant space, and physical coupling. In general, acoustic echo cancellation (AEC) algorithms approximate results of a next sample by comparing the difference between current and one or more previous samples. The information has then been used to predict how sound is altered by an acoustic space. In this case, the model of the acoustic space is continually updated. The changing nature of a sampled signal is mainly due to changes in the acoustic environment, not changes in the characteristics of a loudspeaker, a microphone, or physical coupling. That is, changes in a sampled signal are usually attributable to objects moving in an acoustic environment and movement of a microphone within the environment. For example, when a door is closed or opened, a chair is pulled in closer to a table, or drapes are opened or closed a change in reverberation of sound in an acoustic space occurs. To address changes in acoustic space, an echo cancellation algorithm may employ non-linear processing (NLP), which allows an algorithm to make changes to an acoustic space model that are suggested (but not yet confirmed) by signal comparison.
Hybrid (electric) echo is generated in public switched telephone networks (PSTNs) as a result of the reflection of electrical energy by a hybrid circuit. Hybrid echo may also be generated in voice-over-packet network systems, if the systems contain network elements (such as access gateways) that are equipped with access loop interfaces. As is known, most telephone local loops are two-wire circuits, while transmission facilities are usually four-wire circuits. A hybrid circuit or hybrid (typically, a part of an electronic device called a subscriber line interface circuit (SLIC)) converts a signal between the two and four-wire circuits. Unfortunately, when an impedance mismatch occurs, a hybrid produces a hybrid echo signal. An adaptive filter (included in a line echo canceller or a network echo canceller) learns about characteristics of the hybrid during an adaptation process. The output signal from the adaptive filter is inverted and combined with the hybrid echo signal. When the adaptation process is performed correctly, the result of combination of the hybrid echo signal and the inverted output signal of the adaptive filter produces a very small signal (called an error signal). Ideally, the error signal is small such that the error signal is not perceived audibly.
In practice, the adaptation process usually never produces an ideal characteristic of the hybrid and the error signal is often so large that other approaches for reducing the error signal are needed. A typical method of reducing the energy of the error signal is based on NLP. NLP also usually reduces natural/environmental background noise injected at a near-end of a network connection. As a result, a far-end talker is not exposed to the natural/environmental background noise injected to the telephone connection at the near-end. To compensate and produce more natural conditions, under which the far-end talker participates in the telephone call, an injection of comfort noise by the echo canceller has been employed. Ideally, comfort noise should be indistinguishable from the natural/environmental background noise present at the near-end.