1. Field of the Invention
The present invention relates to speech coding algorithms and, more particularly to a Phase Excited Linear Predictive (PELP) low bit rate speech synthesizer and a pitch detector for a PELP synthesizer.
2. Background of Related Art
Mobile communications are growing at a phenomenal rate due to the success of several different second-generation digital cellular technologies, including GSM, TDMA and CDMA. To improve data throughput and sound quality, considerable effort is being devoted to the development of speech coding algorithms. Indeed, speech coding is applicable to a wide range of applications, including mobile telephony, internet phones, automatic answering machines, secure speech transmission, storing and archiving speech and voice paging networks.
Waveform codecs are capable of providing good quality speech at bit rates down to about 16 kbits/s, but are of limited use at rates lower than 16 kbit/s. Vocoders on the other hand can provide intelligible speech at 2.4 kbits/s and below, but cannot provide natural sounding speech at any bit rate. Hybrid codecs attempt to fill the gap between waveform and source codecs. The most commonly used hybrid codecs are time domain Analysis-by-Synthesis (AbS) codecs. Such codecs use the same linear prediction filter model of the vocal tract as found in Linear Predictive Coding (LPC) vocoders. However, instead of applying a simple two-state, voiced/unvoiced, model to find the necessary filter input, the excitation signal is chosen by matching the reconstructed speech waveform as closely as possible to the original speech waveform.
The distinguishing feature of AbS codecs is how the excitation waveform for the synthesis filter is chosen. AbS codecs split the input speech to be coded into frames, typically about 20 ms long. For each frame, parameters are determined for a synthesis filter, and then the excitation to the synthesis filter is determined by finding the excitation signal which when passed into the synthesis filter minimizes the error between the input speech and the reconstructed speech. Thus, the encoder analyses the input speech by synthesizing many different approximations to the input speech. For each frame, the encoder transmits information representing the synthesis filter parameters and the excitation to the decoder and, at the decoder, the given excitation is passed through the synthesis filter to generate the reconstructed speech. However, the numerical complexity involved in passing every possible excitation signal through the synthesis filter is quite large and thus, must be reduced, but without significantly compromising the performance of the codec.
The synthesis filter is usually an all pole, short-term, linear filter intended to model the correlations introduced into speech by the action of the vocal tract. The synthesis filter may also include a pitch filter to model the long-term periodicities present in voiced speech. Alternatively these long-term periodicities may be exploited by using an adaptive codebook in the excitation generator so that the excitation signal includes a component of the estimated pitch period.
There are various kinds of AbS codecs, such as Multi-Pulse Excited (MPE), Regular-Pulse Excited (RPE), and Code-Excited Linear Predictive (CELP). Generally MPE and RPE codecs will work without a pitch filter, although their performance will be improved if one is included. For CELP codecs a pitch filter is extremely important.
The differences between MPE, RPE and CELP codecs arise from the representation of the excitation signal. In MPE codecs, the excitation signal is given by a fixed number of non-zero pulses for every frame of speech. The positions of these non-zero pulses within the frame and their amplitudes must be determined by the encoder and transmitted to the decoder. In theory it is possible to find the best values for all the pulse positions and amplitudes, but this is not practical due to the excessive complexity required. In practice some sub-optimal method of finding the pulse positions and amplitudes must be used. Typically about 4 pulses per 5 ms can be used for good quality reconstructed speech at a bit-rate of around 10 kbits/s.
Like the MPE codec, the RPE codec uses a number of non-zero pulses to represent the excitation signal. However, the pulses are regularly spaced at a fixed interval, and the encoder only needs to determine the position of the first pulse and the amplitude of all the pulses. Therefore less information needs to be transmitted about pulse positions, so for a given bit rate the RPE codec can use more non-zero pulses than the MPE codec. For example, at a bit rate of about 10 kbits/s around 10 pulses per 5 ms can be used, compared to 4 pulses for MPE codecs. This allows RPE codecs to give slightly better quality reconstructed speech than MPE codecs.
Although MPE and RPE codecs provide good quality speech at rates of around 10 kbits/s and higher, they are not suitable for lower rates due to the large amount of information that must be transmitted about the excitation pulses' positions and amplitudes. If the bit rate is reduced by using fewer pulses or by coarsely quantizing the pulse amplitudes, the reconstructed speech quality deteriorates rapidly.
Currently the most commonly used algorithm for producing good quality speech at rates below 10 kbits/s is CELP. CELP differs from MPE and RPE in that the excitation signal is effectively vector quantized. The excitation signal is given by an entry from a large vector quantizer codebook and a gain term to control its power. The codebook index is represented with about 10 bits and the gain is coded with about 5 bits. Thus, the bit rate necessary to transmit the excitation information is about 15 bits. CELP coding has been used to produce toll quality speech communications at bit rates between 4.8 and 16 kbits/s.
It is an object of the present invention to provide an efficient speech coding algorithm operable at low bit rates yet capable of reproducing high quality speech.