The present invention relates to digital signal processing and particularly to the detection of audio data in a received signal frame.
In message transmission systems, for example in a GSM system as considered by way of example below, no radio signal is sent from the transmitter to the receiver in the case of a voice link during a break in speech. This method is referred to as discontinuous transmission (DTX) and is used both in the uplink direction (from the mobile station to the base station) and in the downlink direction (from the base station to the mobile station). The advantages of the DTX method are the reduced power consumption at the transmitter end and the reduced interference level in the entire radio network.
With activated DTX functionality, no signal is sent from the transmitter to the receiver during a break in speech, which means that only noise is received at the reception end. In this case, the receiver continually attempts to receive a valid GSM signal, for example. If the receiver receives a valid GSM signal, it forwards it to a voice decoder. If the receiver does not receive a valid GSM signal, however, it is assumed that the transmitted signal has been disconnected on account of a break in speech at the transmitter end. In that case, the receiver forwards a comfort noise block to the voice decoder in order to generate artificial background noise of the output of the voice decoder.
During a break in speech, the receiver should therefore receive only noise and replace it with comfort noise (CN) in the voice decoder. Problems arise here if the receiver mistakenly detects the received signal containing no voice data as a valid GSM signal containing voice data. In this case, the supposed GSM signal is not replaced by comfort noise but rather is forwarded to the voice decoder. The information content of the supposed GSM signal is arbitrary, however, which means that a cracking sound (“Bong”) of greater or lesser volume is obtained at the output of the voice decoder. These cracking sounds are generally irritating because they occur during a break in speech, that is to say during a relative silent break in the voice signal.
ETSI specifications 3GPP 46.011, 3 GPP 46.012 and 3GPP 46.031 specify the following standard solution for DTX handling in the full-rate voice decoder:
In a first process, the type of the currently received voice frame is determined. A voice frame corresponds to a voice signal of 20 ms in length. To this end, the bits (flags) determined in the channel decoder—BFI (Bad Frame Indication), SID (Silent Descriptor Frame) and TAF (Time Alignment Flag)—are evaluated. Accordingly, the type of the current voice frame (subsequently also called “Frame Type”) may assume one of the following values:                GOOD_SPEECH: Valid voice frame        UNUSABLE: Invalid voice frame        VALID_SID: Valid SID frame                    Using an SID frame, a.) the comfort noise (background noise) is parameterized at periodic intervals and b.) a DTX period is initiated after a period of speech.                        INVALID_SID: invalid SID frame        
In addition, the current state of the DTX handling is considered. This state (subsequently called “DTX State”) may assume one of the following two values:                SPEECH_STATE: The DTX handling is in this state if a period of speech is currently in progress. That is to say that no comfort noise has been generated by the voice decoder in the past voice frames.        CNI_STATE: The DTX handling is in this state if a break in speech is currently in progress, i.e. if comfort noise has been generated by the voice decoder in the past voice frames.        
On the basis of the frame type and the DTX state, the following data are forwarded to the actual voice decoder:                if the frame type has the value GOOD_SPEECH, this frame is forwarded directly to the voice decoder and the DTX state is set to the value SPEECH_STATE. It is assumed that a period of speech is in progress or that one is just starting.        if the frame type has the value VALID_SID or INVALID_SID, this frame is forwarded to the voice decoder for the purpose of comfort noise generation and the DTX state is set to the value CNI_STATE. It is assumed that a break in speech is in progress or that one is just starting.        if the frame type has the value UNUSABLE, the operation of the voice decoder is dependent on the DTX state.        such a frame type in the DTX state SPEECH_STATE (that is to say during a period of speech) indicates to the voice decoder that this voice frame has been lost and therefore the “Muting Mechanism” needs to be activated.        such a frame type in the DTX state CNI_STATE (that is to say during a break in speech) indicates to the voice decoder that the transmitter has been switched off and therefore a comfort noise frame needs to be inserted.        
A very irritating effect is obtained if a voice frame is mistakenly detected as GOOD_SPEECH in a break in speech (DTX state has the value CNI_STATE). In that case, this supposedly good voice frame is forwarded directly to the voice decoder and produces a cracking sound of greater or lesser volume (depending on its random content) at the output thereof. In addition, the supposedly good voice frame causes the DTX state to change to SPEECH_STATE (supposed start of a new period of speech). Since, in reality, the break in speech has not yet ended, however, the transmitter continues to be switched off, which is why the receiver will detect the frame type UNUSABLE again for the further voice frames. However, these voice frames with the frame type UNUSABLE result in the aforementioned “Muting Mechanism” in the DTX state SPEECH_STATE, i.e. the previously received supposedly valid voice frame is now also repeated and attenuated, which means that the aforementioned cracking sound (as a result of the repetition) is now also given a metallic character (“Bong”).
To compensate for this weakness in the standard solution of the DTX handling, great effort has been made in the past in attempting to improve the basis for frame type determination (BFI, SID and TAF) outside the voice decoder. This has been done by evaluating additional parameters, such as equalizer or channel decoder results. However, this solution has the drawback that it needs to be simulated, implemented and verified afresh for each baseband chip. The actual problem, however, is the lack of robust error concealment in the full-rate voice decoder, which is not covered by the GSM standard.
For these and other reasons, there is a need for the present invention.