In recent years, the telecommunications industry has witnessed an increase in the bandwidth requirements of communication channels. This can mainly be attributed to the increasingly affordable telecommunication services as well as the increased popularity of the Internet. In a typical interaction where two users are communicating via a telephone connection, user A speaks into a microphone or telephone set connected to the public switched telephone network (PSTN). The speech signal is digitised and sent over the telephone lines to a switch. At the switch, the speech is encoded and then divided into blocks for transmission. IP packets and ATM cells are examples of protocols used to create such blocks. These protocols are well known in the art of data transmission. The blocks are transmitted over the communication channel to a receiver switch that takes the blocks and rebuilds the speech signal according to the appropriate protocol. The rebuilt speech is then synthesised at the headset of a user B communicating with the user A.
In a full-duplex conversation where information is simultaneously transmitted in both directions over a two-way channel, a large proportion of the conversation in any one direction is idle or silent. This results in a significant waste of bandwidth since a large portion of this bandwidth is used to transfer silence signals instead of using it to transmit useful information.
Commonly, in order to improve bandwidth usage, transmission of blocks is interrupted during silent or inactive periods. With a high aggregate data rate, the use of statistical multiplexing in combination with the interruption of transmission of the silence blocks can lead to a higher number of users and/or an increase in data throughput for a given communication link. At the receiver end, data representative of silence blocks can be used to “fill-in” the gaps where silence blocks would otherwise occupy.
In addition to the primary talker on either end of the communication channel, there could be a significant amount of background noise, such as car noise, street noise, multiple background talkers, background music, background office noise and many others. Unfortunately, the silence blocks, typically designed to represent white noise, do not well mimic the background noise present when the primary speakers are talking. This results in silence periods at the receiver end where the background noise is different from the background noise when the speaker is speaking, often aggravating for the users of the communication service since the sounds they are hearing are disjointed.
One way to improve the performance of such system is to transmit some blocks of silence information to allow the receiver to better mimic the background noise. In this regard the reader may wish to consult the ITU standard G.729 Annex B and G.723.1 Annex A for more information. The content of the above documents is hereby incorporated for reference.
A deficiency of the above described systems is that they are typically designed for the worst case background noise level, thus transmitting silence blocks for a sufficiently long time duration to allow the receiver to mimic the worst case background noise situation. However, the background noise is most often quiet. This results in lost bandwidth for the transmission of silence blocks that do not carry valuable information.
Another solution is proposed in the co-pending patent application Ser. No. 09/218,009 of W. P. LeBlanc and S. A. Mahmoud, filed on Dec. 22, 1998 and assigned to Nortel Networks Corporation. LeBlanc et al. teach a voice activity detector (VAD) that implements a novel variable hangover algorithm based on input signal characteristics. More specifically, the voice activity detector observes whether a signal conveys active audio information, such as speech, or passive audio information, such as silence or regular background noise, and implements a hangover period of variable duration that dynamically determines how much signal information needs to be sent over the communication channel when the signal contains passive audio information. In general, when the signal contains only silence the hangover period is short since no information is required at the other end of the communication channel. On the other hand when background noise is present, some signal information is sent over the channel to provide enough data permitting to properly train a comfort noise generator that can then synthesize the background noise.
Compared to the traditional fixed hangover algorithm, the variable hangover algorithm proposed by LeBlanc et al. balances the risk of clipping the low-energy end of speech against the risk of excessive hangover due to classification of noise as speech. Accordingly, the variable-duration hangover algorithm provides a better trade off between speech quality and bandwidth efficiency than the fixed-duration hangover algorithm. Unfortunately, the invention of LeBlanc et al. exhibits certain weaknesses. Implementation of the variable hangover period taught by LeBlanc et al. has been found to result in the unwelcome occurrence of signal clipping in certain instances, generally aggravating to the users of the communication service. In particular, the clipping of low-energy speech endings with slightly longer unvoiced sounds was detected, where such unvoiced sounds include speech segments containing fricatives or sibilants. In a specific example, repeated clipping of the ending of the word “six” was perceived, “six” having the end of two unvoiced sounds [ks], [k] being a fricative and [s] being a sibilant.
Accordingly, there exists a need in the industry for an improved method and apparatus for detecting voice signals in a packet voice network, in order to improve speech quality and maximize bandwidth usage.