A source-filter model of speech is illustrated schematically in FIG. 1a. As shown, speech can be modelled as comprising a signal from a source 102 passed through a time-varying filter 104. The source signal represents the immediate vibration of the vocal chords, and the filter represents the acoustic effect of the vocal tract formed by the shape of the throat, mouth and tongue. The effect of the filter is to alter the frequency profile of the source signal so as to emphasise or diminish certain frequencies. Instead of trying to directly represent an actual waveform, speech encoding works by representing the speech using parameters of a source-filter model.
As illustrated schematically in FIG. 1b, the encoded signal will be divided into a plurality of frames 106, with each frame comprising a plurality of subframes 108. For example, speech may be sampled at 16 kHz and processed in frames of 20 ms, with some of the processing done in subframes of 5 ms (four subframes per frame). Each frame comprises a flag 107 by which it is classed according to its respective type. Each frame is thus classed at least as either “voiced” or “unvoiced”, and unvoiced frames are encoded differently than voiced frames. Each subframe 108 then comprises a set of parameters of the source-filter model representative of the sound of the speech in that subframe.
For voiced sounds (e.g. vowel sounds), the source signal has a degree of long-term periodicity corresponding to the perceived pitch of the voice. In that case, the source signal can be modelled as comprising a quasi-periodic signal with each period comprising a series of pulses of differing amplitudes. The source signal is said to be “quasi” periodic in that on a timescale of at least one subframe it can be taken to have a single, meaningful period which is approximately constant; but over many subframes or frames then the period and form of the signal may change. The approximated period at any given point may be referred to as the pitch lag. An example of a modelled source signal 202 is shown schematically in FIG. 2a with a gradually varying period P1, P2, P3, etc., each comprising four pulses which may vary gradually in form and amplitude from one period to the next.
According to many speech coding algorithms such as those using Linear Predictive Coding (LPC), a short-term filter is used to separate out the speech signal into two separate components: (i) a signal representative of the effect of the time-varying filter 104; and (ii) the remaining signal with the effect of the filter 104 removed, which is representative of the source signal. The signal representative of the effect of the filter 104 may be referred to as the spectral envelope signal, and typically comprises a series of sets of LPC parameters describing the spectral envelope at each stage. FIG. 2b shows a schematic example of a sequence of spectral envelopes 2041, 2042, 2043, etc. varying over time. Once the varying spectral envelope is removed, the remaining signal representative of the source alone may be referred to as the LPC residual signal, as shown schematically in FIG. 2a. 
The spectral envelope signal and the source signal are each encoded separately for transmission. In the illustrated example, each subframe 106 would contain: (i) a set of parameters representing the spectral envelope 204; and (ii) a set of parameters representing the pulses of the source signal 202.
In the illustrated example, each subframe 106 would comprise: (i) a quantised set of LPC parameters representing the spectral envelope, (ii)(a) a quantised LTP vector related to the correlation between pitch-periods in the source signal, and (ii)(b) a quantised LTP residual signal representative of the source signal with the effects of both the inter-period correlation and the spectral envelope removed.
The residual signal comprises information present in the original input speech signal that is not represented by the quantized LPC parameters and LTP vector. This information must be encoded and sent with the LPC and LTP parameters in order to allow the encoded speech signal to be accurately synthesized at the decoder.
It is common to provide forward error correction when transmitting packetized data over a lossy channel. FEC adds information about the content of a previous packet to the current packet. If that previous packet is received, the primary information it contains is used for decoding an output signal. If, on the other hand, the previous packet was lost, then the FEC information in the current packet can be used to update the state of the decoder and to decode an output signal for the lost packet.
Forward error correction FEC can roughly be divided into two categories, media specific and media independent FEC. Media independent FEC works by adding redundancy to the bits of two or more payloads. One example of this is simply XORing multiple payloads to create the redundant information. If any of the payloads is lost, then the XORed information together with the other payloads can be used to recreate the lost payload. Reed Solomon Coding is another example of media independent FEC. In the case of media independent FEC no re-encoding of the signal takes place.
Media dependent FEC includes methods where a lower bitrate speech coder is used to generate the redundant information through a process of re-encoding the signal. The redundant information is piggy backed to other packets. Also this is sometimes called low bit rate redundancy LBRR. For example, see IETF RFC 2354, and RFC 2198.
In order for FEC to work it is important that the bit rate can be controlled. For media independent FEC this can be achieved by increasing the delay and XORing more packets together. However, for real time communication increasing the delay is not a desirable solution. Also in combination with a variable bit rate speech coder the XORing FEC has a deficiency because the size of the redundant information block is determined by the largest payload used in the XORing process. Further more, the length has to be sent as side information, thus creating extra overhead.
When another, lower bit rate, speech coder is used to generate the redundant information, the bit rate can be controlled as long as there are coders operating at different rates available. The drawback of this solution is that the two encoders need to be operating in parallel which results in a large complexity increase. Low bit rate speech coders often exploit long term correlation to encode the signal efficiently, which means that the encoder/decoder states needs to be in sync for correct decoding. This also means an increased complexity on the decoder side as two decoders are required operating in parallel.
It is an aim of some embodiments of the present invention to address, or at least mitigate, some of the above identified problems of the prior art.