Modern wireless communication systems such as GSM (Global System for mobile communications) and UMTS (Universal Mobile Telecommunications System) transfer various types of data over the air interface between the network elements such as a base station and a mobile terminal. As the general demand for transfer capacity continuously rises due to e.g. new multimedia services coming available, new more efficient techniques have to be developed respectively for data compression as radio frequencies can nowadays be considered as scarce resources. Data compression is traditionally also used for reducing storage space requirements in computer data systems, for example. Likewise, different methods for picture, video, music and speech coding have been developed during the last few decades.
Data is usually compressed (˜compacted) by utilizing a so-called encoder to be subsequently regenerated with a decoder for later exploitation whenever needed. Data coding techniques may be classified according to a number of different approaches. One is based on the coding result the (en)coder produces; a lossless encoder compacts the source data but any information is actually not lost during the encoding process, i.e. after decoding the data matches perfectly with the un-encoded data, meanwhile a lossy coder produces a compacted presentation of the source data the decoding result of which does not completely correspond to the original presentation anymore. However, a data loss is not a problem in situations wherein the user of the data cannot either distinguish the differences between the original and once compacted data, or the differences do not, at least, cause severe difficulties or objection in exploiting slightly degraded data. As human senses including hearing and vision are somewhat limited it's, for example, possible to extract unnecessary details from pictures, video or audio signals-without considerably disturbing the final sensation effect. Often source coders produce fixed rate output meaning the compaction ratio does not depend on the input data. Alternatively, a variable-rate coder takes statistics of the input signal into account while analysing it thus outputting compacted data with variable rate. Variable-rate coding surely has certain benefits over fixed-rate models. Considering e.g. the field of speech coding a variable-rate codec (coder-decoder) can maximise the capacity and minimize the average bit-rate for given speech quality. This originates from the non-stationarity (or quasi-stationarity) of a typical human speech signal; a single speech segment, as the coders process a certain period of speech at a time, may comprise either very homogenous signal (e.g. periodically repetitive voiced sound) or strongly fluctuating signal (transitions etc) thus directly affecting the minimum amount of bits required for sufficient representation of the segment under analysis. In addition, considering especially mobile networks achieved savings in source coding may be used for enhancing e.g. channel coding thus resulting a better tolerance against interference on the radio path. Fixed-rate coders always need to operate at a compromise rate that is low enough to save transmission capacity but high enough to code difficult segment with adequate quality, the compromise rate obviously being unnecessary high for “easier” speech segments.
Still, as the nature and targeted use of the source data defines on case-by-case basis the optimum means for compacting it, an idea of a generic optimum coder directly applicable for any possible scenario is utopistic; development of source coding has been diverged into many directions utilizing the data statistics and imperfections of human senses into maximum account in a specialized manner.
In case of mobile networks a speech coder is definitely one of the most crucial elements in providing the caller/callee a satisfactory call experience in addition to various voice storage and voice message services. Modern speech coders have a common starting point: compact representation of digitised speech while preserving speech quality, truly a subjective measure concerning e.g. speech intelligibility and naturalness although sometimes also “objectively” measured by utilizing weighted distortion measures, but the techniques used in modeling greatly vary. One speech-coding model heavily utilized today is called CELP (Code Excited Linear Prediction). CELP coders like GSM EFR (Enhanced Full Rate), UMTS adaptive multi-rate coder AMR and TETRA ACELP (Algrebraic Code Excited Linear Prediction) belong to the group of AbS (Analysis by Synthesis) coders and produce the speech parameters by modeling the speech signal via minimizing an error between the original and speech in a loop. CELP coders carry features from both waveform (common PCM etc) and vocoder techniques.
Vocoders are parametric coders that exploit, for example, a source-filter approach in speech parameterisation. The source models the signal originated by air-flow emitting from the lungs to glottis either through vibrating (resulting voiced sounds) or stiff (resulting unvoiced sounds with turbulence originated from different shapes within the vocal tract) vocal cords up to the oral cavities (mouth, throat) to be finally radiated out through the lips.
FIG. 1 discloses a generic sketch of a simplified human speech production model, called an LP (Linear Predictive) model that is utilized in many contemporary speech coding methods like CELP. The process is called linear prediction since current output S(n) is determined by a weighted sum of previous output values and an input value generated by pulse source 102 or noise source 104 depending on the nature of speech, roughly being divided to either voiced in the first and unvoiced in the latter case. Pulse source 102 emitting the impulse train imitates the vibration at the glottis with a corresponding fundamental frequency called a pitch frequency with a certain pitch period. Source type may be altered during the synthesis process via switch 106. Before filtering the excitation source signal with all-pole IIR (Infinite Impulse Response) filter 110 modeling the vocal tract it is multiplied by a proper gain factor in multiplier 108. Therefore, speech synthesis can be performed by first defining the class of current speech segment under consideration as either voiced or unvoiced, and then by driving the excitation signal of the selected type through a multiplier and a synthesis filter. More about LP and speech modeling or coding in general can be found in reference [1].
A typical CELP coder, presented in FIG. 2, and a corresponding decoder, presented in FIG. 3, comprises several filters for modeling speech generation, namely at least a short-term filter such as an LP(C) synthesis filter used for modeling the spectral envelope (formants; resonances introduced by vocal tract) and a long-term filter the purpose of which is to model the oscillation of the vocal cords inducing periodicity in the voiced excitation signal comprising impulses separated by the current pitch period called a lag. The modeling is substantially targeted to a single speech segment, called a frame hereinafter, at a time. As can be noticed from FIG. 3, the decoder structure reminds of the common LP synthesis model with an additional LTP (Long-Term Prediction) filter. The excitation signal is created on the basis of an excitation vector for the respective block. For example, in ACELP coders the excitation consists of a fixed number of non-zero pulses the position and amplitude of which is selected by utilizing a search in which a perceptually weighted error term between the original and synthesized speech frame is minimized.
Considering CELP encoding and decoding in more detail a preview of codec internals is presented herein. The encoder includes short-term analysis function 204 to form a set of direct form filter coefficients called LP parameters a(i), where i=1, 2, . . . , m (m thus defining the order of the analysis), for example. Parameters a(i) are calculated once for a speech frame of N samples, N corresponding e.g. a time period of 20 milliseconds. As speech has a quasi-stationary nature meaning it may be considered as stationary if the inspection period is short enough (<=20 ms), optimum filter coefficients can be calculated for a single frame by utilizing standard mathematic means such as Wiener filter theory, which requires signal stationarity, on frame-by-frame basis. Resulting equation with computationally exhaustive matrix inversion may then be effectively calculated by exploiting e.g. so-called autocorrelation method and Levinson-Durbin recursion. See reference [2] for further information. LP parameters a(i) are exploited in searching the lag value matching best with the speech frame under analysis, in calculating a so-called LP residual by filtering the speech with LPC analysis (or “inverse”) filter, being the inverse A(z) of LPC synthesis filter 1/A(z), and naturally as coefficients of LPC synthesis filter 210 while creating a synthesized speech signal ss(n). The lag value is calculated in LTP analysis block 202 and used by LTP synthesis filter 208. The long-term predictor and corresponding synthesis filter 208 being the inversion thereof is typically like an LP predictor with a single tap only. The tap may optionally have a gain factor g2 of its own (thus defining the total gain of the one tap LTP filter). LP parameters are also utilized in the excitation codebook search as described below.
In a basic CELP coder, after definition of proper lag value T and LP parameters a(i), iteration for a perfect excitation codebook vector according to the selected error criteria is started. In some advanced coding models it's possible to fine-tune the lag value or even LP parameters while searching a perfect excitation vector. During an iteration round, excitation vector c(n) is selected from codebook 206, filtered through LTP and LPC synthesis filters 208, 210 and the resulting synthesised speech ss(n) is finally compared 218 with the original speech signal s(n) in order to determine the difference, error e(n). Weighting filter 212 that is based on the characteristics of human hearing is used to weight error signal e(n) in order to attenuate frequencies at which the error is less important according to the auditory perception, and to correspondingly amplify frequencies that matter more. For example, errors in the areas of “formant valleys” may be emphasized as the errors in the synthesized speech are not so audible in the formant frequencies due to the auditory masking effect. Codebook search controller 214 is used to define index u of the code vector in codebook 206 according to the weighted error term acquired from weighting filter 212. Consequently, index u indicating a certain excitation vector leading to a minimum possible weighted error is eventually selected. Controller 214 provides also scaling factor g that is multiplied 216 with the code vector under analysis before LTP and LPC synthesis filtering. After a frame has been analysed, parameters describing the frame (a(i), LTP parameters like T and optionally also gain g2, codebook vector index u or other identifier thereof, codebook scaling factor g) are sent over transmission channel (air interface, fixed transfer medium etc) to the speech decoder at the receiving end.
Referring to FIG. 3, excitation codebook 306 corresponds to the one in the encoder used for generating excitation signal c(n) on the basis of received codebook index u. Excitation signal c(n) is then multiplied 312 with scaling factor g and directed to LTP synthesis filter supplied with necessary parameters T and g2. Finally the effect of the vocal tract is added to the synthesized speech signal by LPC synthesis filtering 310 providing decoded speech signal ss(n) as an output.
Considering next fixed codebook vector selection in an ACELP type speech encoder, the pulse positions are determined by minimizing the error between the actual weighted input speech and a synthesized version thereof:e2=(sp−g2HV−gHc)2  (1)where sp is perceptually weighted input speech, H is an LP model impulse response matrix utilizing calculated LP parameters, c is the selected codebook vector and v is a so-called “adaptive codebook” vector explained later in the text. The minimization of the above error is in practise performed by maximizing the term:
                                          (                                                            s                  ~                                T                            ⁢                              Hc                k                                      )                    2                                      c            k            t                    ⁢                      H            t                    ⁢                      Hc            k                                              (        2        )            where {tilde over (s)}=sp−g2Hv is hereinafter called a “target signal” being equivalent to the perceptually weighted input speech signal from which the contribution of the adaptive codebook has been removed. k is the index of fixed codebook vector c under analysis.
The concept of the adaptive codebook is illustrated in FIG. 4 disclosing the CELP synthesis model in an alternative manner being quite similar to the common human speech production model of FIG. 1. However, the main difference lies in the excitation signal generation part: as seen from FIG. 4 in CELP coders the selection of voiced/unvoiced excitation is not usually made at all and the excitation includes adaptive codebook part 402 and fixed codebook part 404 corresponding to excitation signals v(n) and c(n) respectively, which are first individually weighted g2, g and then summed 408 together to form final excitation u(n) for LPC synthesis filter 410. Thus the periodicity of the LP residual presented in FIGS. 2 and 3 with a separate LTP filter connected in series with the LPC synthesis filter can be alternatively depicted as a feedback loop and adaptive codebook 402 comprising a delay element controlled by lag value T.
To concretise the goal of the algebraic fixed codebook search that is performed after LPC and LTP analysis stages, an imaginary target signal of a single frame that should be modeled with an algebraic codebook to a maximum extent is presented in FIG. 5. Now if two pulses are to be allocated per frame (bold arrows), an optimum position for them is nearby peaks 502, 504 in order to minimize the energy left in the remaining error signal. In this particular example, exactly two pulses with adjustable sign can be included in the frame. In a typical encoder, the number of codebook pulses per frame and amplitudes thereof is predefined although the overall amplitude of codebook vector c(n) can be altered via gain factor g. In addition to mere frames the original signal may be divided into a number of sub-frames (e.g. 1-4) as well, which are then separately parameterised in relation to all or some of the required parameters. For example, LPC analysis that results LPC coefficients may be executed only once per frame thus a single set of LP parameters covers the whole frame whereas codebook vectors (fixed algrebraic and/or adaptive) can be analysed for each sub-frame.
Gain factor g can be calculated by
                    g        =                                                                              s                  ~                                T                            ⁢                              Hc                k                                                                    c                k                T                            ⁢                              H                T                            ⁢                              Hc                k                                              .                                    (        3        )            Although contemporary methods for modeling and regenerating an applicable excitation signal for EP synthesis filter seem to provide somewhat adequate results in many cases, a number of problems still exist therein. It's obvious that depending on the original input signal the prediction error may or may not have serious peaks left in the time domain presentation. The scenario can vary, and thus the fixed number of corrective pulses per frame may sometimes be enough to rise the modeling accuracy into a moderate level but sometimes not. Occasionally, as with some of the existing speech coders, the modeling result may actually get worse by adding unnecessary pulses into the excitation signal when the codec specifications do not allow to alter the number of pulses in a single frame. On the other hand, if the number of pulses in a frame and thus the total output bitrate is varied, the modeling process is surely more flexible but also more complex what comes to reception of variable length frames etc. Variable output bit-rate may also complicate network planning as transmission resources required by a single connection for transferring speech parameters are not fixed anymore.
FIG. 8A discloses a target signal in a scenario wherein a frame has been divided into four sub-frames. LPC analysis is performed once per frame, and LTP and fixed codebook analysis on a sub-frame basis. The target signal comprises severe fluctuations 802, 804, 806, 808 in sub-frame 3. However, as algebraic code vectors contain only two pulses sharp, they may be placed to cover peaks 802 and 804, but peaks 806 and 808 are left intact thus reducing the modeling result.
Another defect in prior art coders relates to so called closed-loop search of the adaptive codebook vector relating to the LTP analysis.
Usually an open-loop analysis is executed first in order to find a rough estimate of the lag T and gain g2 concerning e.g. a whole frame at a time. During open-loop search a weighted speech signal is just correlated with delayed versions of itself one at a time in order to locate correlation maximas. Considering found occurrences of these autocorrelation maximas, the corresponding delay values, in principle especially the one producing the highest maximum, then moderately predict the lag term T as the correlation maximum often results from the speech signal periodicity.
Thereafter, in a more accurate closed-loop adaptive codebook search LTP filter lag T and gain g2 values are determined by minimizing the weighted error between the original and synthesized speech as in the algrebraic fixed codebook search. This is achieved e.g. in the AMR codes on sub-frame basis by maximizing the term:
                              R          ⁡                      (            k            )                          =                                            ∑                              n                =                0                            L                        ⁢                                                            s                  p                                ⁡                                  (                  n                  )                                            ⁢                                                y                  k                                ⁡                                  (                  n                  )                                                                                                        ∑                                  n                  =                  0                                L                            ⁢                                                                    y                    k                                    ⁡                                      (                    n                    )                                                  ⁢                                                      y                    k                                    ⁡                                      (                    n                    )                                                                                                          (        4        )            where L is sub-frame length (e.g. 40 samples) −1, y(n)=v(n)*h(n) and yk is thus the past LP synthesis filtered excitation (adaptive codebook vector) at delay k. More details about open/closed loop searches especially in the case of AMR codec can be found in reference [3]. However, as it's clear that the actual excitation for the span of the current frame is still unknown upon maximising the above term, the current LP residual is used as substitute in scenarios with short delay values. See FIG. 9A for clarification. If delay k is short enough, i.e. signal yk requires samples from the current sub-frame, any excitation for the current sub-frame is not yet available as the algebraic search is still to be conducted. Therefore, a straightforward solution is to use already available LP residual (may be initially calculated even to the whole frame) as a substitute for the missing part of the excitation vector corresponding to a time period between legends 902 and 904. On the other hand, a buffer for previous excitation can usually be made large enough, three dots emphasize this in the figure, in order to avoid situations where delay k is correspondingly too long, and the required excitation is not available in the buffer anymore.