In digital radio telephone systems the information, i.e. speech, is digitally encoded prior to being transmitted over the air. The encoded speech is then decoded at the receiver. First, an analogue speech signal is digitally encoded using Pulse Code Modulation (PCM) for example. Then speech coding and decoding of the PCM speech (or original speech) is implemented by speech coders and decoders. Due to the increase in use of radio telephone systems the radio spectrum available for such systems is becoming crowded. In order to make the best possible use of the available radio spectrum, radio telephone systems utilise speech coding techniques which require low numbers of bits to encode the speech in order to reduce the bandwidth required for the transmission. Efforts are continually being made to reduce the number of bits required for speech coding to further reduce the bandwidth required for speech transmission.
A known speech coding/decoding method is based on linear predictive coding (LPC) techniques, and utilises analysis-by-synthesis excitation coding. In an encoder utilising such a method, a speech sample is first analysed to derive parameters which represent characteristics such as wave form information (LPC) of the speech sample. These parameters are used as inputs to short-term synthesis filter. The short-term synthesis filter is excited by signals which are derived from a code book of signals. The excitation signals may be random, e.g. a stochastic code book, or may be adaptive or specifically optimised for use in speech coding. Typically, the code book comprises two parts, a fixed code book and the adaptive code book. The excitation outputs of respective code books are combined and the total excitation input to the short term synthesis filter. Each total excitation signal is filtered and the result compared with the original speech sample (PCM coded) to derive an "error" or difference between the synthesised speech sample and the original speech sample. The total excitation which results in the lowest error is selected as the excitation for representing the speech sample. The code book indices, or addresses, of the location of respective partial optimal excitation signals in the fixed and adaptive code book are transmitted to a receiver, together with the LPC parameters or coefficients. A composite code book identical to that at the transmitter is also located at the receiver, and the transmitted code book indices and parameters are used to generate the appropriate total excitation signal from the receiver's code book. This total excitation signal is then fed to a short-term synthesis filter identical to that in the transmitter, and having the transmitted LPC coefficients as respective inputs. The output from the short-term synthesis filter is a synthesised speech frame which is the same as that generated in the transmitter by the analysis-by-synthesis method.
Due to the nature of digital coding, although the synthesised speech is objectively accurate it sounds artificial. Also, degradations, distortions and artifacts are introduced into the synthesised speech due to quantisation effects and other anomalies due to the electronic processing. Such artifacts particularly occur in low bit-rate coding since there is insufficient information to reproduce the original speech signal exactly. Hence there have been attempts to improve the perceptual quality of synthesised speech. This has been attempted by the use of post-filters which operate on the synthesised speech sample to enhance its perceived quality. Known post-filters are located at the output of the decoder and process the synthesised speech signal to emphasise or attenuate what are generally considered to be the most important frequency regions in speech. The importance of respective regions of speech frequencies has been analysed primarily using subjective tests on the quality of the resulting speech signal to the human ear. Speech can be split into two basic parts, the spectral envelope (formant structure) or the spectral harmonic structure (line structure), and typically post-filtering emphasises one or other, or both of these parts of a speech signal. The filter coefficients of the post-filter are adapted depending on the characteristics of the speech signal to match the speech sounds. A filter emphasising or attenuating the harmonic structure is typically referred to as a long-term, or pitch or long delay post filter, and a filter emphasising the spectral envelope structure is typically referred to as a short delay post filter or short-term post filter.
A further known filtering technique for improving the perceptual quality of synthesised speech is disclosed in International Patent Application WO 91/06091. A pitch prefilter is disclosed in WO 91/06091 comprising a pitch enhancement filter, normally disposed at a position after a speech synthesis or LPC filter, moved to a position before the speech synthesis or LPC filter where it filters pitch information contained in the excitation signals input to the speech synthesis or LPC filter.
However, there is still a desire to produce synthesised speech which has even better perceptual quality.