For convenience, various abbreviations used in this specification are presented here:
TFOTandem Free OperationCNIComfort Noise InsertionCNComfort NoiseBFHBad Frame HandlingUMSUplink Mobile StationDMSDownlink Mobile StationUBSUplink Base StationUTRUplink TranscoderDTRDownlink TranscoderDBSDownlink Base StationAIAir InterfacePCMPulse Coded ModulationPSTNPublic Switched Telephone NetworkUAIUplink Air InterfaceDAIDownlink Air InterfaceDTXDiscontinuous TransmissionVADVoice Activity Detection
Speech frames received by the mobile network from a mobile communication means can be roughly classified into three classes: a) uncorrupted, i.e. good speech frames; b) corrupted speech frames; and c) frames generated during discontinued transmission (DTX) mode, which frames generally include silence descriptor (SID) frames and unusable frames received during the transmission pause.
In normal mode of operation, a mobile unit encodes the speech to be transmitted, and the encoded speech is decoded after transmission through the air interface. When a mobile unit receives a call, the speech is encoded at the network side of the air interface, and decoded in the receiving mobile unit. Therefore, in normal mode of operation without special arrangements taking place, speech is encoded and decoded twice in a mobile-to-mobile call, resulting in a decrease of perceived speech quality. Tandem free operation (TFO) is a mode of operation between two mobile units, in which the speech is encoded only once, and the speech is transmitted in the encoded form over the network to the receiving mobile unit.
Since it is not feasible to send the error indication information contained in erroneous frames and the side information contained in DTX frames through the mobile network to the receiving end, it has been found feasible in GSM to transmit during TFO operation all frames over the A-interface as good frames. The A-interface is the interface between the transmitting and receiving mobile networks. In conventional non-TFO operation, the speech is transmitted over the A-interface as a digital real-time waveform as PCM-coded samples.
A so-called bad frame handling procedure is used in converting erroneous frames received from the mobile communication means to good frames for transmission over the A-interface. In order to send comfort noise information contained in DTX frames over the A-interface, the comfort noise information has to be converted into good speech frames for transmission over the A-interface.
Comfort noise insertion is discussed first in more detail in the following paragraphs, then bad frame handling.
Comfort Noise Insertion
In Discontinuous Transmission (DTX), a Voice Activity Detector (VAD) detects on the transmit side whether or not the user is speaking. When the user is speaking, speech parameters descriptive of the input speech are produced in the speech encoder for each frame and transmitted to the receiving end. However, when the user stops speaking, parameters descriptive of the prevailing background noise are produced and transmitted to the receive side instead of the speech parameters. After this, the transmission is switched off. The transmission is resumed at the normal transmission rate when the user starts speaking again, or at a low rate to update the parameters describing the background noise while the user does not speak in order to adapt to changes occurring in the prevailing background noise during the transmission pause. Throughout this text, these parameters describing the prevailing background noise are referred to as comfort noise parameters or CN parameters.
At the receiving end, speech is synthesised whenever good speech parameter frames are received. However, when comfort noise parameters have been received, after which the transmission has been switched off, the speech decoder uses the received comfort noise parameters to locally synthesise noise with characteristics similar to the background noise on the transmit side. This synthetic noise is commonly referred to as Comfort Noise (CN), and the procedure of generating CN locally on the receive side is commonly referred to as Comfort Noise Insertion (CNI).
The updated comfort noise parameters are applied to the CNI procedure either immediately when received, or by gradually interpolating frame-by-frame from the previously received comfort noise parameter values to the updated parameter values. The former method guarantees that the comfort noise parameters are always as fresh as possible. However, the former method may result in stepwise effects in the perceived CN characteristics, and thus the latter method of interpolation is often used to alleviate this inconvenience. The latter method has the drawback in that the interpolation of the received comfort noise parameters introduces some delay in characterisation of the prevailing background noise, thereby introducing some contrast between the actual background noise and the CN.
Details of comfort noise insertion are described in the ETSI specification ETS 300 580-4, “European digital cellular telecommunications system (Phase 2); Comfort noise aspect for full rate speech traffic channels (GSM 06.12)”, September 1994, which is hereinafter called the GSM 06.12 specification.
Bad Frame Handling
Bad frame handling (BFH) refers to a substitution procedure for frames containing errors. The purpose of the frame substitution is to conceal the effect of corrupted frames, since normal decoding of corrupted or lost speech frames would result in very unpleasant noise effects. In order to improve the subjective quality of the received speech, the first lost speech frame is substituted with either a repetition or an extrapolation of the previous good speech frames. Corrupted speech frames are not transmitted to the receiving end. If a number of consecutive frames is lost, the output of the speech decoder is gradually muted in order to indicate the user about the problems in the connection. The frame substitution procedure is discussed in the ETSI specification draft pr ETS 300 580-3, “Digital cellular telecommunications system; Full rate speech; Part 3: Substitution and muting of lost frames for full rate speech channels (GSM 06.11 version 4.0.5)”, November 1997, which is hereinafter called the GSM 06.11 specification.
Mobile to Mobile Calls
In the following, the flow of the speech data during a normal, non-TFO connection is discussed. The case of TFO operation is discussed after that.
The basic block diagram of the mobile to mobile call is illustrated in FIG. 1. In an Uplink Mobile Station (UMS) 100, i.e. the mobile station in the transmitting end, the time-domain waveform is first divided into fixed-length frames and speech encoded in a speech coding block 101, i.e., transformed to speech coding parameters, which are then channel encoded in a channel coding block 102 by inserting redundant information for error correction purposes. These protected speech frames are then transmitted over the air interface (AI).
In an Uplink Base Station (UBS) 110, the channel decoding is performed in the channel decoding block 111, i.e., the channel errors are corrected and the redundant information is removed from the speech coding parameters. The speech coding parameters are transmitted through a serial Uplink Abis interface to an Uplink Transcoder (UTR) 120, where the speech coding parameters are transformed to a digital time-domain speech waveform in a speech decoding block 122. In normal non-TFO mode, the switch 121 is open as shown in FIG. 1; and the speech waveform is passed through a TFO packing block 123 essentially unchanged. The output of the UTR is transmitted through the A-interface to a public switched telephone network (PSTN) or to another mobile telephone network.
In a Downlink Transcoder (DTR) 130, the time-domain waveform is received from the A-interface. In non-TFO-operation, the switch 133 connects the output of the speech encoding block 132 to the output of the DTR, and the TFO extracting block 131 passes through the time-domain waveform unchanged. The waveform is transformed to speech coding parameters in the speech encoding block 132. The speech coding parameters are forwarded to the Downlink Abis interface.
In the downlink base station (DBS) 140, the speech parameters received from the Downlink Abis interface are channel encoded in the channel encoding block 141. The channel encoded parameters are transmitted to a Downlink Mobile Station (DMS) 150, i.e. the receiving mobile station. In the DMS, the channel coding is removed in a channel decoding block 151 and the speech coding parameters are transformed back to a time-domain waveform, i.e. decoded speech, in the speech decoding block 152.
The problem in the conventional mode described above is the negative effect of two consecutive encodings on the quality of the transmitted speech signal. Since the encoding of the waveform in the speech encoding block 132 of the Downlink Transcoder (DTR) 130 is the second successive compression to the original input signal, the parameters in the output of the speech encoder 132 of the DTR 130 represent a time-domain waveform which is not a very accurate reproduction of the original speech waveform due to the errors created in two compressions. The tandem-free operation (TFO) was designed to alleviate this problem in at least some cases.
Tandem-Free Operation
In a mobile station to mobile station telephone call utilising a tandem-free mode of operation, hereinafter referred to as TFO, speech is transmitted by sending the parameters representing the time-domain speech waveform from an uplink mobile station speech encoder directly to a downlink mobile station speech decoder, without converting the parameters into a time-domain speech waveform in between the uplink transcoder and the downlink transcoder.
This significantly improves the speech quality because without TFO, the original speech signal is coded twice with the lossy speech compression algorithm which degrades the speech quality each time the compression is applied. The difference between the single encoding and the tandem encoding becomes even more important when the bit-rate of a speech codec is very low. The old high bit-rate speech coding standards, as exemplified by the G.711 standard of 64 kbit/s PCM coding, are very robust to successive coding. However, the state of the art speech coders operating in a range of 4 kbit/s to 16 kbit/s are quite sensitive to more than one successive coding.
The tandem-free operation according to prior art is discussed in the following with reference to FIG. 1. In tandem-free operation, the speech parameters received by the speech decoding block 122 of the uplink transcoder 120 are embedded into the least significant bits of the decoded speech waveform in the TFO packing block 123, which is indicated in FIG. 1 by the closed position of the switch 121. The speech waveform with the embedded speech parameters is then forwarded to the A-interface.
In order to enable the TFO mode, the downlink end of the call must naturally be in a mobile telephone network using the same speech coding standard as the uplink end. However, the call may be forwarded from the A-interface through several digital transmission links to the downlink mobile telephone network.
In the receiving end, the speech waveform with the embedded speech parameters is received from the A-interface by the downlink transcoder 130. The TFO extracting block 131 extracts the embedded speech parameters from the speech waveform. In TFO operation, the switch 133 connects the output of the TFO extracting block to the output of the downlink transcoder. The extracted original parameters are then forwarded to the downlink Abis interface and further via the downlink base station 140 through the air interface to the downlink mobile station, whose speech decoding block 152 then decodes the original speech parameters as encoded by the speech encoding block of the uplink mobile station 100.
Sometimes there are detected and undetected errors in the Air interface. These errors and the BFH operations can cause some mismatch between the parameters of speech encoder 101 of the transmitting mobile station and speech decoder 152 of the receiving mobile station. Usually these mismatches are diminished after the correct parameters have been received for several consecutive frames.
BFH and CNI Handling in Tandem Free Operation
Usually the functionality for bad frame handling and comfort noise insertion in the transmitting end is located in the speech decoder block 122 of the uplink transcoder 120. These functions are not illustrated in FIG. 1. When any speech frames are corrupted or lost, or DTX transmission pauses occur, the speech decoder block 122 generates speech coding parameters corresponding to these situations as described previously.
As can be observed from FIG. 1, the UMS 100, UBS 110, DBS 140 and the DMS 150 are not involved in the TFO operations concerning the BFH and CNI, but operate transparently as in the non-TFO case. The speech encoder 132 of the DTR operates normally during TFO as well, except that its output is not forwarded to the downlink Abis interface, but is replaced with the speech coding parameters extracted from the A-interface stream instead. The operations concerning the BFH and CNI take place in the speech decoder 122 of the UTR 120.
A more detailed block diagram of the prior art speech decoder 122 realizing the CNI and BFH functions is shown in FIG. 2. The encoded speech parameters, i.e. the parameter quantisation indices are extracted from the received information stream in parameter extracting blocks 122a. The BFH and CNI operations are performed on these parameter quantisation indices in BFI/CNI blocks 122b prior to the dequantisation (decoding) of the indices in dequantisation blocks 122c. After dequantisation, the parameters are used in speech synthesis in a speech synthesis block 122d to produce the decoded output signal. The BFI and CNI flags are signals produced by the uplink base station 110, which signals inform the decoder 122 about corrupted and DTX frames. The BFI/CNI blocks 122b are controlled by the BFI and CNI flags.
A similar block diagram with prior art TFO functionality is shown in FIG. 3, which shows a diagram of the speech decoder 122 of an UTR 120 as well as the TFO packing block 123. As can be observed from FIG. 3, the CNI and the BFH operations are performed on the parameter quantisation indices in the speech decoder 122. Therefore the tandem free operations in the UTR 120 are simply effected by packing (embedding) of the already available parameters from the decoder 122 into the time-domain waveform signal.
BFH operations during tandem free operation are straightforward, and can be effected in the same way as in non-TFO mode. The GSM 06.11 specification contains an example prior art solution of the BFH functionality, which can also be used during tandem free operation. The CNI operations are simple because the quantisations are memoryless, which means that all information during comfort noise generation or in the transitions between active speech and comfort noise is contained in the currently transmitted parameters. There are no problems for example in the resetting of the different parts of the transmission path. The prior art CNI solution is described in the specification GSM 06.12.
In tandem free operation, the parameter information packed to the signal transmitted to the A-interface must include all information needed to produce good speech frames, since the downlink mobile station is not aware of the CNI-operation at the uplink end. Due to this requirement, a simple conversion is performed on the comfort noise parameters to convert them to speech parameter frames. This involves storing the most recent comfort noise parameters, and repeatedly forwarding them to the A-interface stream until updated comfort noise parameters are received and stored, or until active speech parameters are received. In case comfort noise parameter interpolation is desired as discussed earlier, this interpolation can be performed prior to forwarding the parameters to A-interface stream. Since comfort noise parameters do not include all parameters present in a good speech parameter frame, these missing speech parameters need to be created in some way during the conversion process.
Problems Inherent in the Prior Art Solutions
FIG. 3 shows a decoder using conventional non-predictive quantisers. When the quantisers of the decoder are non-predictive as in FIG. 3, BFH and CNI processing of the parameters do not create any problems. However, it is predictive quantisers that are used in the state of the art low rate encoders and decoders.
In a state of the art speech codec employing predictive quantisers, comfort noise insertion and bad frame handling operations have to be performed using the dequantised (decoded) parameters in the speech decoder, i.e. after the dequantiser blocks 122c and not before them as shown in FIG. 3. The reason for this is that in predictive quantising and dequantising, the quantised entities (in this case, speech parameters) are not independent. When evaluating (decoding) predictively quantised entities, the evaluation result for each evaluated entity does not depend only on the quantised entity under evaluation, but also on the previous entities. Therefore, simple substitution of corrupted encoded parameters to suitable CN or BFH parameters is not possible. The substitution would have to adjust the substituting CN or BFH parameters according to the previously received good parameters, but since there is no knowledge of the development of the signal during the transmission pause or disturbance, the next good parameters received would depend on another history than that generated in the decoder, resulting in very annoying sound artifacts at the end of the pause. Therefore, CNI and BFH operations are effected after predictive dequantization on the decoded speech parameters, and coded speech parameters corresponding to CNI or BFH blocks are not available. Since the coded parameters describing CNI or BFH blocks are not available, they cannot be embedded in the time-domain speech waveform along with the rest of the coded parameters. Because of this problem, CNI and BFH operations are not possible in prior art tandem free operation, when the uplink mobile station uses a speech codec with predictive quantisers.