In a communication system, the input audio signals are generally encoded and then transmitted to the peer. In a communication system, especially, a wireless/mobile communication system, channel bandwidth is scarce. In a bidirectional conversation, the time for one party to speak occupies about half of the total conversation time, and the party is silent in the other half of the conversation time. When the channel bandwidth is stringent, if the communication system transmits signals only when a person is speaking but stops transmitting signals when the person is silent, plenty of bandwidth will be saved for other users. For that purpose, the communication system needs to know when the person starts speaking and when the person stops speaking. That is, the communication system needs to know when a speech is active, which involves Voice Activity Detection (VAD). Generally, when a speech is active, the voice coder performs coding at a high rate; when handling the background signals without voice, the coder performs coding at a low rate. Through the VAD technology, the communication system knows whether an input audio signal is a voice signal or a background noise, and performs coding through different coding technologies.
The foregoing mechanism is practicable in general background environments. However, when the background signals are music signals, low rates of coding deteriorate the subjective perception of the listener drastically. Therefore, a new requirement is raised. That is, the VAD system is required to identify the background music scenario effectively and improve the coding quality of the background music pertinently.
A technology for detecting complex signals is put forward in the Adaptive Multi-Rate (AMR) VAD1. “Complex signals” here refer to music signals. For each frame in the AMR VAD, the maximum correlation vector of this frame is obtained from the AMR coder, and normalized into the range of [0-1]. A long-term moving average correlation vector “corr_hp” of the normalized best_corr_hpm is calculated through the following formula:corr—hp=α·corr—hp+(1−α)·best—corr—hpm,
where α is a forgetting factor that falls within [0.8, 0.98]
The corr_hp of each frame is compared with the upper threshold and the lower threshold. If the corr_hp of 8 consecutive frames is higher than the upper threshold, or the corr_hp of 15 consecutive frames is higher than the lower threshold, the complex signal flag “complex_warning” is set to 1, indicating that a complex signal is detected.
In the process of implementing the present invention, the inventor finds at least the following defects in the prior art:
The prior art can detect music signals, but cannot tell whether the music signals are foreground music or background music, and cannot apply an appropriate coding technology to the background music signals according to the bandwidth conditions. Moreover, the prior art may treat conventional background noise like babble noise as a complex signal, which is adverse to saving bandwidth.