This invention relates to a communication system which comprises an encoder device for encoding a sequence of digital speech signals into a set of excitation pulses and/or a decoder device communicable with the encoder device.
As known in the art, a conventional communication system of the type described is used for transmitting a speech signal at a low transmission bit rate, such as 4.8 kb/s, from a transmitting end to a receiving end. The transmitting and the receiving ends are comprised of an encoder device and a decoder device which are operable to encoder and decode the speech signals, respectively, in the manner which will be described more in detail. A wide variety of such systems have been proposed to improve speech quality reproduced in the decoder device and to reduce the transmission bit rate.
Among others, a pitch interpolation multi-pulse system has been proposed in Japanese Unexamined Patent Publications Nos. Syo 61-15000 and 62-038500, namely, 15000/1986 and 038500/1987 which may be called first and second references, respectively. In this pitch interpolation multi-pulse system, the encoder device is supplied with a sequence of digital speech signals at every frame of, for example, 20 milliseconds and extracts a spectrum parameter and a pitch parameter which will be called first and second primary parameters, respectively. The spectrum parameter is representative of a spectrum envelope of a speech signal specified by the digital speech signal sequence while the pitch parameter is representative of a pitch of the speech signal. Thereafter, the digital speech signal sequence is classified into a voiced sound and an unvoiced sound which last for voiced and unvoiced durations, respectively. In addition, the digital speech signal sequence is divided at every frame into a plurality of pitch durations which may be referred to as subframes, respectively. Under the circumstances, operation is carried out in the encoder device to calculate a set of excitation pulses representative of a sound source signal specified by the digital speech signal sequence.
More specifically, the sound source signal for the voiced duration is represented by the excitation pulse set which is calculated with respect to a selected pitch durations that may be called a representative duration. From this fact, it should be understood that each set of the excitation pulses is extracted from an intermittent subframe. Subsequently, an amplitude and a location of each excitation pulse of the set are transmitted from the transmitting end to the receiving end along with the spectrum and the pitch parameters. On the other hand, a sound source signal of a single frame for the unvoiced duration is represented by a small number of excitation pulses and a noise signal. Thereafter, an amplitude and a location of each excitation pulse is transmitted for the unvoiced duration together with a gain and an index of the noise signal. At any rate, the amplitudes and the locations of the excitation pulses, the spectrum and the pitch parameters, and the gains and the indices of the noise signals are sent as a sequence of output signals from the transmitting end to the receiving end, comprising a decoder device.
On the receiving end, the decoder device is supplied with the output signal sequence as a sequence of reception signals which carries information related to sets of excitation pulses extracted from frames, as mentioned above. Consider a current set of excitation pulses extracted from a representative duration of a current frame and a next set of excitation pulses extracted from a representative duration of a next frame following the current frame. In this event, interpolation is carried out for the voiced duration by the use of the amplitudes and the locations of the current and the next sets of the excitation pulses to reconstruct excitation pulses in the remaining subframes except the representative durations and to reproduce a sequence of driving sound source signals for each frame. On the other hand, a sequence of driving sound source signals for each frame is reproduced for an unvoiced duration by the use of indices and gains of the excitation pulses and the noise signals.
Thereafter, the driving sound source signals thus reproduced are given to a synthesis filter formed by the use of a spectrum parameter and are synthesized into a synthesized sound signal.
With this structure, each set of the excitation pulses is intermittently extracted from each frame in the encoder device and is reproduced into the synthesized sound signal by an interpolation technique in the decoder device. Herein, it is to be noted that an intermittent extraction of the excitation pulses makes it difficult to reproduce the driving sound source signal in the decoder device at a transient portion at which the sound source signal is changed in its characteristic. Such a transient portion appears when a vowel is changed to another vowel on concatenation of vowels in the speech signal and when a voiced sound is changed to another voiced sound. In a frame including such a transient portion, the driving sound source signals reproduced by the use of the interpolation technique is terribly different from actual sound source signals, which results in degradation of the synthesized sound signal in quality.
Furthermore, the above-mentioned pitch interpolation multi-pulse system is helpful to conveniently represent the sound source signals, when the sound source signals have distinct periodicity. However, the sound source signals do not practically have distinct periodicity at a nasal portion within the voiced duration. Therefore, it is difficult to correctly or completely represent the sound source signals at the nasal portion by the pitch interpolation multi-pulse system.
On the other hand, it has been confirmed by a perceptual experiment that the transient portion and the nasal portion are very important for perceptivity of phonemes and for perceptivity of naturality or natural feeling. Under the circumstances, it is readily understood that a natural sound cannot be reproduced for the voiced duration by the conventional pitch interpolation multi-pulse system because of an incomplete reproduction of the transient and the nasal portions.
Moreover, the sound source signals are represented by a combination of the excitation pulses and the noise signals for the unvoiced duration in the above-mentioned system, as described before. It has been known that a sound source of a fricative is also represented by a noise signal during a consonant appearing for the voiced duration. This means that it is difficult to reproduce a synthesized sound signal of a high quality when the speech signals are classified into two species of sounds, such as voiced and unvoiced sounds.
It is mentioned here that the spectrum parameter for a spectrum envelope is generally calculated in an encoder device by analyzing the speech signals by the use of a linear prediction coding (LPC) technique and is used in a decoder device to form a synthesis filter. Thus, the synthesis filter is formed by the spectrum parameter derived by the use of the linear prediction coding technique and has a filter characteristic determined by the spectrum envelope. However, when female sounds, in particular, "i" and "u" are analyzed by the linear prediction coding technique, it has been pointed out that an adverse influence appears in a fundamental wave and in the harmonic waves of a pitch frequency. Accordingly, the synthesis filter has a band width which is very narrower than a practical band width determined by a spectrum envelope of practical speech signals. Particularly, the band width of the synthesis filter becomes extremely narrow in a frequency band which corresponds to a first formant frequency band. As a result, no periodicity of a pitch appears in a reproduced sound source signal. Therefore, a speech quality of the synthesized sound signal is unfavorably degraded when the sound source signals are represented by the excitation pulses extracted by the use of the interpolation technique on the assumption of the periodicity of the sound source.