There is speech in approximately only 40% of time of voice communication, and there is silence or background noise (collectively referred to as background noise below) in all other time. To reduce transmission bandwidth of the background noise, a discontinuous transmission (DTX) system and a comfort noise generation (CNG) technology appear.
DTX means that an encoder intermittently encodes and sends an audio signal in a background noise period according to a policy, instead of continuously encoding and sending an audio signal of each frame. Such a frame that is intermittently encoded and sent is generally referred to as a silence insertion descriptor (SID) frame. The SID frame generally includes some characteristic parameters of background noise, such as an energy parameter and a spectrum parameter. On a decoder side, a decoder may generate consecutive background noise recreation signals according to a background noise parameter obtained by decoding the SID frame. A method for generating consecutive background noise in a DTX period on the decoder side is referred to as CNG. An objective of the CNG is not accurately recreating a background noise signal on an encoder side, because a large amount of time-domain background noise information is lost in discontinuous encoding and transmission of the background noise signal. The objective of the CNG is that background noise that meets a subjective auditory perception requirement of a user can be generated on the decoder side, thereby reducing discomfort of the user.
In an existing CNG technology, comfort noise is generally obtained using a linear prediction-based method, that is, a method for using random noise excitation on a decoder side to excite a synthesis filter. Although background noise can be obtained using such a method, there is a specific difference between generated comfort noise and original background noise in terms of subjective auditory perception of a user. When a continuously encoded frame is transited to a comfort noise frame, such a difference in the subjective perception of the user may cause subjective discomfort of the user.
A method for using CNG is stipulated in the adaptive multi-rate wideband (AMR-WB) standard in the 3rd Generation Partnership Project (3GPP), and a CNG technology of the AMR-WB is also based on linear prediction. In the AMR-WB standard, a SID frame includes a quantized background noise signal energy coefficient and a quantized linear prediction coefficient, where the background noise energy coefficient is a logarithmic energy coefficient of background noise, and the quantized linear prediction coefficient is expressed by a quantized immittance spectral frequency (ISF) coefficient. On a decoder side, energy and a linear prediction coefficient that are of current background noise are estimated according to energy coefficient information and linear prediction coefficient information that are included in the SID frame. A random noise sequence is generated using a random number generator, and is used as an excitation signal for generating comfort noise. A gain of the random noise sequence is adjusted according to the estimated energy of the current background noise such that energy of the random noise sequence is consistent with the estimated energy of the current background noise. Random sequence excitation obtained after the gain adjustment is used to excite a synthesis filter, where a coefficient of the synthesis filter is the estimated linear prediction coefficient of the current background noise. Output of the synthesis filter is the generated comfort noise.
In a method for generating comfort noise using a random noise sequence as an excitation signal, although relatively comfortable noise can be obtained, and a spectral envelope of original background noise can also roughly recovered, a spectral detail of the original background noise may be lost. As a result, there is still a specific difference between generated comfort noise and the original background noise in terms of subjective auditory perception. Such a difference may cause subjective auditory discomfort of a user when a continuously encoded speech segment is transited to a comfort noise segment.