In the arrangement used in modern speech encoding techniques, speech codecs process the speech signal in periods, which are called speech frames or just frames. Here the term codec means the arrangement by which speech can be encoded. Preferably it comprises an encoding algorithm and means for implementing it on a speech signal. A typical frame length of a speech codec is 20 ms, which corresponds to 160 samples at a sampling frequency of 8 kHz. The speech frames generally vary from 10 ms to 30 ms. Each speech frame is processed in a speech encoder, and certain encoding parameters are formed of these frames and transmitted to the decoder. The decoder forms a synthesized speech signal by means of those parameters.
In digital cellular radiotelephony systems, such as the GSM (Global System for Mobile communications), a discontinuous transmission method (DTX, Discontinuous Transmission), which is also defined in many speech encoding standards, is generally used. The discontinuous transmission method generally means that the transmitter part of the terminal is switched off for most of the time when the user does not speak i.e., when the terminal has nothing to transmit. The purpose of this is to reduce the average power consumption of the terminal and to improve the utilization of radio frequencies, because transmitting a signal, which carries just silence, causes unnecessary interference with other simultaneous radio connections. According to some research, only 40% of the data transmitted contains actual speech data. The rest is silence or background noise. Thus a discontinuous transmission method, in which frames that do not contain actual speech are removed, provides many advantages. Firstly, the processing load of the encoder can be reduced, because the “redundant” frames are not encoded at all. Secondly, when the number of frames to be transmitted is reduced, the power consumption of the device is also reduced. Furthermore, the loading of the network can be reduced, when “redundant” frames are removed from the data to be transmitted.
An operation called Voice Activity Detection (VAD) is used for speech detection in a discontinuous transmission method. The voice activity detection takes place e.g. so that a voice activity detector is arranged to examine each frame to be transmitted, and on the basis of the examination it is concluded whether the frame contains speech data or not. The operation of the voice activity detector is based on its internal variables, and the output of the detector is preferably one bit, which is called the VAD flag. Value 1 of the VAD flag then corresponds to a situation where there is speech to be processed, and value 0 a situation where the user is silent. Thus when the flag is up, the frame contains speech data and it can be transmitted. Correspondingly, when the VAD flag is down, the frame can be entirely removed.
The discontinuous transmission method has one disadvantage. When the transmission is interrupted, the background noise that exists in the frames that contain speech, also disappears. This may cause a very unpleasant effect at the receiving end. In a discontinuous transmission method, the interruption of the transmission may take place quickly and at irregular intervals, whereby the receiver experiences the quickly changing voice level as disturbing. Especially when the level of the background noise is high, the interruption of the transmission may even make it more difficult to understand the speech. Therefore it is advantageous to produce in the receiver some synthetic noise, which resembles the background noise of the transmitter and which is called Comfort Noise (CN), even when no frames are transmitted to the receiving end.
The production of comfort noise takes place e.g. so that at first the level of the actual background noise is estimated by means of some frames that contain background noise when the value of the VAD flag changes from one to zero. The element that decides about the discontinuous transmission mode transmits these few frames to the receiver as speech frames. This period when the speech burst has ended, but the transmission of speech frames has not yet been switched off, is called a hangover period. The frames that are transmitted during the hangover period, only contain data caused by background noise, whereby the parameters of the comfort noise can be safely determined by means of these frames. A Silence Descriptor (SID) frame is advantageously used for transmitting the comfort noise parameters to the receiver. The values of the parameters of the SID frames are updated regularly, and at least when the level of the background noise changes. In practice, the SID frame can be used in at least the following two ways. Firstly, a SID frame is transmitted immediately after the hangover period. After this, SID frames are transmitted regularly. An arrangement like this is used in the speech codecs of the GSM system, for example. Another possibility is to transmit a SID frame immediately after the hangover period, but to transmit the next SID frame only when the encoder detects a change in the characteristics of the background noise.
In an ideal situation, both the transmitting terminal and the receiving terminal use the same speech encoding method. In a case like this, the encoded speech need not be changed suitable for some other encoding method. However, in practice this is often necessary. In a situation like this, the encoded speech data is encoded differently by means of a transcoder. The transcoder can be located at any point of the signal path between the transmitter and the receiver.
The prior art transcoders are typically implemented in a manner shown in FIG. 1. The input of the transcoder consists of the input parameters 101 transmitted by the transmitter. The discontinuous transmission reception block 102 of the transcoder has been arranged to estimate whether the parameters received contain speech or comfort noise. Information about the contents of the frame is transmitted to the speech encoder 104 by means of the SP (Speech Present) flag 103, for example. In addition, the frame is also transmitted to the speech decoder 104. The decoding method of the frame depends on the value of the SP flag 103. After decoding, the synthesized speech or comfort noise is transferred to the internal buffer circuit 105 of the transcoder. The recoding of the contents of the buffer circuit 105 is started when the buffer circuit 105 contains a sufficient amount of data. When data is recoded, the voice activity detector 106 is used at first to examine whether the frame contains speech or background noise. On the basis of the quality of the data contained by the frame, the voice activity detector 106 forms a VAD flag 107 and gives it a value. In addition, it transmits the value of the VAD flag 107 and the frame that arrived to it as such forward to the speech encoder 108. The value of the VAD flag 107 is also given to the transmitter unit 110 of the transcoder. The speech encoder 108 processes the data coming to it and transmits the parameters 109 of the encoded data to the transmitter unit 110. The transmitter unit 110 checks on the basis of the values of the VAD flags 107 it received which frames are to be transmitted to the network and which not. In order to make the receiver block of the terminal receiving the signal also to maintain the generation of comfort noise, some frames containing comfort noise can also be transmitted to the receiver, and the parameters of these frames containing comfort noise have been updated in the speech encoder 108, when required.
The problem in the prior art solutions is the fact that the voice activity detector is used twice. For the first time it is used in the encoder circuit of the transmitting terminal and then again in the transcoder. In practice, this means that unnecessary computation procedures are carried out when speech data is transmitted, because in prior art solutions the same voice activity detection procedure is performed twice on the same data flow.