Low bit-rate speech codecs have usually a discontinuous transmission (DTX) functionality to decrease bit-rate or channel activity over the transmission channel. The DTX functionality uses a voice activity detector (VAD) prior speech encoder to detect speech pauses from active speech bursts. During speech pauses typically only a level and spectral information are encoded as silence description (SID) frames or comfort noise frames are used to be sent and these contribute a much lower bit-rate over the channel. In a radio channel, reduced bit rate enables more capacity or better transmission quality due to less radio interference. In packet based transmission systems (e.g. ATM or IP radio access or core network transmission), the DTX enables more transmission capacity due to a statistical multiplexing phenomenon when speech pauses of multiple calls equalize the gross bit-rate over a fast bit-rate packed based transmission channel.
Due to the previous reasons, the DTX functionality is a key feature of speech communication systems that are using a radio or packet transmission media. Therefore the DTX is widely used in cellular and VoIP networks.
The voice activity detector has a key role in determining whether speech or pause is present at the input of speech encoder. Misclassifications by the VAD lead to either a loss of actual speech if speech is classified as a pause or a too high channel activity if a pause is classified as an active speech.
In addition to the basic speech vs. pause classification, the VAD should detect properly certain special audio signals. It is desirable that information tones, e.g. ringing and busy tones, are not detected as a speech pause or as background noise. This requirement differs from the basic VAD functionality because both background noise during speech pauses and information tones are typically very stationary. Therefore the basic VAD would easily classify both signals as background noise. Typically VAD has an additional tone detection functionality to ensure that information tones are transmitted continuously over the channel.
Another special audio signal, that VAD has to detect properly, is music. It is necessary to detect music signals correctly and not to allow SID frames to be sent over the channel during the whole duration of music. It is undesirable that a part of music is detected as a pause or background noise. This behavior results in a temporal clipping of music or a part of the music sequence can be replaced with high level comfort noise. The latter phenomenon may generate annoying noise bursts into the middle of music sequence. State-of-the-art VADs have some kind of music detector to circumvent this problem.
However as the world is full of different music styles and pieces, it is impossible to design an in-band detector that always would detect music from background noise. Therefore there is an increased risk that VAD makes misclassifications and annoying sound effects are heard by the end users while listening to music from the terminal.
Especially the music detection problem may be crucial for a new added-value feature called “caller tunes” or “personalized ringback tone”. In this feature, the conventional ringing tone (ringback tone) sent back to the caller terminal has been replaced with real music. This feature has been used as an extra service by cellular operators. It is clear that an absolutely robust method for the music detection is required for this application. Conventional in-band music detectors to be used together with the DTX are typically not robust enough. As the system is not reliable enough because of this suboptimal performance, the DTX functionality must either be disabled from the network or not to provide caller tunes feature. The first option would of course affect negatively the network capacity. The latter would prevent a new business opportunity. So, there is a clear need to improve the music detection capability of DTX functionality, which should overcome the above mentioned problems, for example, when music is applied instead of ringback tone.
Some speech codecs, e.g. AMR (Adaptive multirate), have built-in music detectors. However, it has been found that built-in music detector may be unable to detect music correctly in all circumstances. Therefore, there is a need to find further ways to address the above described problem. It may be advantageous for this purpose to find alternative approaches than developing music detection systems.