Cellular communication networks are commonplace today. Cellular communication networks typically operate in accordance with a given standard or specification. For example, the standard or specification may define the communication protocols and/or parameters that shall be used for a connection. Examples of the different standards and/or specifications include, without limiting to these, GSM (Global System for Mobile communications), GSM/EDGE (Enhanced Data rates for GSM Evolution), AMPS (American Mobile Phone System), WCDMA (Wideband Code Division Multiple Access) or 3rd generation (3G) UMTS (Universal Mobile Telecommunications System), IMT 2000 (International Mobile Telecommunications 2000) and so on.
In a cellular communication network, voice data is typically captured as an analogue signal, digitised in an analogue to digital (A/D) converter and then encoded before transmission over the wireless air interface between a user equipment, such as a mobile station, and a base station. The purpose of the encoding is to compress the digitised signal and transmit it over the air interface with the minimum amount of data whilst maintaining an acceptable signal quality level. This is particularly important as radio channel capacity over the wireless air interface is limited in a cellular communication network. The sampling and encoding techniques used are often referred to as speech encoding techniques or speech codecs.
Often speech can be considered as bandlimited to between approximately 200 Hz and 3400 Hz. The typical sampling rate used by a A/D converter to convert an analogue speech signal into a digital signal is either 8 kHz or 16 kHz. The sampled digital signal is then encoded, usually on a frame by frame basis, resulting in a digital data stream with a bit rate that is determined by the speech codec used for encoding. The higher the bit rate, the more data is encoded, which results in a more accurate representation of the input speech frame. The encoded speech can then be decoded and passed through a digital to analogue (D/A) converter to recreate the original speech signal.
An ideal speech codec will encode the speech with as few bits as possible thereby optimising channel capacity, while producing decoded speech that sounds as close to the original speech as possible. In practice there is usually a trade-off between the bit rate of the codec and the quality of the decoded speech.
In today's cellular communication networks, speech encoding can be divided roughly into two categories: variable rate and fixed rate encoding.
In variable rate encoding, a source based rate adaptation (SBRA) algorithm is used for classification of active speech. Speech of differing classes are encoded by different speech modes, each operating at a different rate. The speech modes are usually optimised for each speech class. An example of variable rate speech encoding is the enhanced variable rate speech codec (EVRC).
In fixed rate speech encoding, voice activity detection (VAD) and discontinuous transmission (DTX) functionality is utilised, which classifies speech into active speech and silence periods. During detected silence periods, transmission is performed less frequently to save power and increase network capacity. For example, in GSM during active speech every speech frame, typically 20 ms in duration, is transmitted, whereas during silence periods, only every eighth speech frame is transmitted. Typically, active speech is encoded at a fixed bit rate and silence periods with a lower bit rate.
Multi-rate speech codecs, such as the adaptive multi-rate (AMR) codec and the adaptive multi-rate wideband (AMR-WB) codec were developed to include VAD/DTX functionality and are examples of fixed rate speech encoding. The bit rate of the speech encoding, also known as the codec mode, is based factors such as the network capacity and radio channel conditions of the air interface.
AMR was developed by the 3rd Generation Partnership Project (3GPP) for GSM/EDGE and WCDMA communication networks. In addition, it has also been envisaged that AMR will be used in future packet switched networks. AMR is based on Algebraic Code Excited Linear Prediction (ACELP) coding. The AMR and AMR WB codecs consist of 8 and 9 active bit rates respectively and also include VAD/DTX functionality. The sampling rate in the AMR codec is 8 kHz. In the AMR WB codec the sampling rate is 16 kHz.
ACELP coding operates using a model of how the signal source is generated, and extracts from the signal the parameters of the model. More specifically, ACELP coding is based on a model of the human vocal system, where the throat and mouth are modelled as a linear filter and speech is generated by a periodic vibration of air exciting the filter. The speech is analysed on a frame by frame basis by the encoder and for each frame a set of parameters representing the modelled speech is generated and output by the encoder. The set of parameters may include excitation parameters and the coefficients for the filter as well as other parameters. The output from a speech encoder is often referred to as a parametric representation of the input speech signal. The set of parameters is then used by a suitably configured decoder to regenerate the input speech signal.
Details of the AMR and AMR-WB codecs can be found in the 3GPP TS 26.090 and 3GPP TS 26.190 technical specifications. Further details of the AMR-WB codec and VAD can be found in the 3GPP TS 26.194 technical specification. All the above documents are incorporated herein by reference.
Both AMR and AMR-WB codecs are multi rate codecs with independent codec modes or bit rates. In both the AMR and AMR-WB codecs, the mode selection is based on the network capacity and radio channel conditions. However, the codecs may also be operated using a variable rate scheme such as SBRA where the codec mode selection is further based on the speech class. The codec mode can then be selected independently for each analysed speech frame (at 20 ms intervals) and may be dependent on the source signal characteristics, average target bit rate and supported set of codec modes. The network in which the codec is used may also limit the performance of SBRA. For example, in GSM, the codec mode can be changed only once every 40 ms.
By using SBRA, the average bit rate may be reduced without any noticeable degradation in the decoded speech quality. The advantage of lower average bit rate is lower transmission power and hence higher overall capacity of the network.
Typical SBRA algorithms determine the speech class of the sampled speech signal based on speech characteristics. These speech classes may include low energy, transient, unvoiced and voice sequences. The subsequent speech encoding is dependent on the speech class. Therefore, the accuracy of the speech classification is important as it determines the speech encoding and associated encoding rate. In previously known systems, the speech class is determined before speech encoding begins.
Furthermore, the AMR and AMR-WB codecs may utilise SBRA together with VAD/DTX functionality to lower the bit rate of the transmitted data during silence periods. During periods of normal speech, standard SBRA techniques are used to encode the data. During silence periods, VAD detects the silence and interrupts transmission (DTX) thereby reducing the overall bit rate of the transmission.
Although effective, SBRA algorithms are very complex and require a large amount of memory and resources to implement. As such, their usage has so far been limited due to the substantial overheads.
It is the aim of embodiments of the present invention to provide an improved speech encoding method that at least partly mitigates some of the above problems.