Despite the proliferation of alternative modes of communication, verbal communication is often the preferred method for exchanging information. In particular, telephonic communication has enabled speaking and listening between two parties to span the globe. The intersection of current digital and Internet technology and voice communication, however, is not without challenges.
One such challenge is efficiently utilizing available bandwidth. Digital communication systems necessarily require converting analog voice or audio signals to digital signals. The digital signals in turn occupy bandwidth as they navigate to their destination. Maximizing bandwidth, and the efficient utilization thereof, are omnipresent concerns for Internet and multimedia communications.
Another challenge is creating a communication environment with which the users are familiar and comfortable. The benchmark for voice and noise communication is the telephone. Telephonic communication is rich with sounds, inflections, nuances, and other characteristics of verbal communication. The extra features available to verbal communication add context to the communication and should be preserved in Internet or multimedia communication applications. Further, the connection is always open in the sense that during of the telephone call, each call participant can generally hear what is happening on the other end. Unfortunately, transmitting silence, or background noise without any accompanying voice, is an inefficient bandwidth use for most communication applications.
The International Telecommunication Union Recommendation G.729 (“G.729”) describes fixed rate speech coders for Internet and multimedia communications. In particular, the coders compress speech and audio signals at a sample rate of 8 kHz to 8 kbps. The coding algorithm utilizes Conjugate-Structure Algebraic-Code-Excited-Linear-Prediction (“CS-ACELP”) and is based on a Code-Exited Linear-Prediction (“CELP”) coding model. The coder operates on 10 millisecond speech frames corresponding to 80 samples at 8000 samples per second. Each transmitted frame is first analyzed to extract CELP model parameters such as linear-prediction filter coefficients, adaptive and fixed-codebook indices and gains. The parameters are encoded and transmitted. At the decoder side, the speech is reconstructed by utilizing a short-term synthesis filter based on a 10th order linear prediction. The decoder further utilizes a long-term synthesis filter based on an adaptive codebook approach. The reconstructed speech is post-filtered to enhance speech quality.
G.729 Annex B (“Annex B”) defines voice activity detection (“VAD”), discontinuous transmission (“DTX”), and comfort noise generation (“CNG”) algorithms. In conjunction with the G.729, Annex B attempts to improve the listening environment and bandwidth utilization over that created by G.729 alone. In short, and with reference to FIG. 1, the algorithms and systems employed by Annex B detect the presence or absence of voice activity with a VAD 104. When the VAD 104 detects voice activity, it triggers an Active Voice Encoder 103, transmits the encoded voice communication over a Communication Channel 105, and utilizes an Active Voice Decoder 108 to recover Reconstructed Speech 109. When the VAD 104 does not detect voice activity, it triggers a Non Active Voice Encoder 102, that in conjunction with the Communication Channel 105 and a Non Active Voice Decoder 107, transmits and recovers Reconstructed Speech 109.
The nature of Reconstructed Speech 109 depends on whether or not the VAD 104 has detected voice activity. When VAD 104 detects voice activity, the Reconstructed Speech 109 is the encoded and decoded voice that has been transmitted over Communication Channel 105. When VAD 104 does not detect voice activity, Reconstructed Speech 109 is comfort noise per the Annex B CNG algorithm. Given that in general, more than 50% of the time speech communication proceeds in intervals between spoken words, methods to reduce the bandwidth requirements of the non speech intervals without interfering with the communication environment are desired.