Code excited linear prediction, or CELP, is a method of encoding speech for transmission over a network. CELP is a data compression algorithm that is well suited for digital audio signals containing human speech. CELP and other speech encoders reduce the amount of data that must be transmitted across the network by transmitting compressed (encoded) data to the receiving end, which much decompress (decode) the data to reproduce the audio signal.
As used herein, the term “speech energy” refers to the energy of the audio signal that can be attributed to speech content. An audio stream containing speech is said to have high speech energy, while an audio stream containing silence or background noise but no speech is said to have low speech energy.
There are some situations in which the receiver may want to determine the speech energy of the bit stream and to make some decision based on the determined speech energy. For example, one form of noise reduction is to measure the speech energy of a signal and mute the signal if the speech energy is below some threshold. Automatic gain control circuits may boost the amplification for soft signals (i.e., having low speech energy) and reduce the amplification for loud signals (i.e., having high speech energy).
As another example, in a voice conference having multiple participants, the signal from each participant's microphone is typically sent to a conference bridge, mixed with the signals from the other participants' microphones, and broadcast to the speakers of all the participants. When the number of participants becomes large, simply mixing the input from all microphones becomes impractical, because the signal to noise ratio is reduced and because the competing voices overlap, distort, and/or cancel each other. In other words, the output signal is noisy and the voices of the participants become indistinct. In this situation it is desirable to determine which speakers are actively talking and add only those inputs into the mix. The input signals from speakers who are not talking are not fed into the mix, improving the fidelity of the output.
The standard way to compute the speech energy of a signal is to first obtain the speech signal in PCM samples, then perform the sum squared of those PCM samples, frame-by-frame or sub-frame-by-sub-frame. However, this approach has the disadvantage that the packet payloads must be fully decoded before the signal strength of the voice payload can be measured.
A method to estimate speech signal energy based on EVRC codec parameters is presented in Doh-Suk Kim et al., “Frame energy estimation based on speech codec parameters” ICASSP, 1641-1644 (2008) and U.S. Patent Application Publication No. 2009/0094026 to Cao et al., “Method of determining an estimated frame energy of a communication.” In this method, decoded parameters are used to estimate the excitation energy λe(m), which is used an in input into an LPC synthesis filter. The resulting impulse response represents the estimated LPC synthesis filter energy λh(K,m), where K is the number of samples used to compute the impulse response. An estimated speech energy λ(m) is calculated using the estimated excitation energy λe(m) and the estimated LPC synthesis filter energy λh(K,m). While this method provides an estimated frame energy that correlates well with the actual frame energy, this method requires a transform to generate the impulse response of the synthesis filter. Since each frame has its own set of LPC synthesis filter parameters, the impulse response must be recalculated every frame, which is computationally expensive.
Accordingly, there exists need for systems and methods of determining the speech energy of an encoded audio signal without requiring complete decoding of an encoded data-stream and without computationally expensive transform operations. Specifically, there exists a need for systems and methods to estimate speech energy based on CELP parameters extracted from a partially-decoded CELP-encoded bit stream. Such an estimation could be used, for example, to select active speakers in a teleconferencing bridge without having to fully decode the CELP-encoded bit streams, and fully decode only the bit streams of some or all of the active speakers.