This invention relates to a communication system which comprises an encoder device for encoding a sequence of input digital speech signals into a set of excitation multipulses and/or a decoder device communicable with the encoder device.
As known in the art, a conventional communication system of the type described is helpful for transmitting a speech signal at a low transmission bit rate, such as 4.8 kb/s from a transmitting end to a receiving end. The transmitting and the receiving ends comprise an encoder device and a decoder device which are operable to encode and decode the speech signals, respectively, in the manner which will presently be described more in detail. A wide variety of such systems have been proposed to improve a speech quality reproduced in the decoder device and to reduce a transmission bit rate.
Among others, there has been known a pitch interpolation multipulse system which has been proposed in Japanese Unexamined Patent Publications Nos. Syo 61-15000 and 62-038500, namely, 15000/1986 and 038500/1987 which may be called first and second references, respectively. In this pitch interpolation multipulse system, the encoder device is supplied with a sequence of input digital speech signals at every frame of, for example, 20 milliseconds and extracts a spectrum parameter and a pitch parameter which will be called first and second primary parameters, respectively. The spectrum parameter is representative of a spectrum envelope of a speech signal specified by the input digital speech signal sequence while the pitch parameter is representative of a pitch of the speech signal. Thereafter, the input digital speech signal sequence is classified into a voiced sound and an unvoiced sound which last for voiced and unvoiced durations, respectively. In addition, the input digital speech signal sequence is divided at every frame into a plurality of pitch durations which may be referred to as subframes, respectively. Under the circumstances, operation is carried out in the encoder device to calculate a set of excitation multipulses representative of a sound source signal specified by the input digital speech signal sequence.
More specifically, the sound source signal is represented for the voiced duration by the excitation multipulse set which is calculated with respect to a selected one of the pitch durations that may be called a representative duration. From this fact, it is understood that each set of the excitation multipulses is extracted from intermittent ones of the subframes. Subsequently, an amplitude and a location of each excitation multipulse of the set are transmitted from the transmitting end to the receiving end along with the spectrum and the pitch parameters. On the other hand, a sound source signal of a single frame is represented for the unvoiced duration by a small number of excitation multipulses and a noise signal. Thereafter, the amplitude and the location of each excitation multipulse is transmitted for the unvoiced duration together with a gain and an index of the noise signal. At any rate, the amplitudes and the locations of the excitation multipulses, the spectrum and the pitch parameters, and the gains and the indices of the noise signals are sent as a sequence of output signals from the transmitting end to a receiving end comprising a decoder device.
On the receiving end, the decoder device is supplied with the output signal sequence as a sequence of reception signals which carries information related to sets of excitation multipulses extracted from frames, as mentioned above. Let consideration be made about a current set of the excitation multipulses extracted from a representative duration of a current one of the frames and a next set of the excitation multipulses extracted from a representative duration of a next one of the frames following the current frame. In this event, interpolation is carried out for the voiced duration by the use of the amplitudes and the locations of the current and the next sets of the excitation multipulses to reconstruct excitation multipulses in the remaining subframes except the representative durations and to reproduce a sequence of driving sound source signals for each frame. On the other hand, a sequence of driving sound source signals for each frame is reproduced for an unvoiced duration by the use of indices and gains of the excitation multipulses and the noise signals.
Thereafter, the driving sound source signals thus reproduced are given to a synthesis filter formed by the use of a spectrum parameter and are synthesized into a synthesized speech signal.
With this structure, each set of the excitation multipulses is intermittently extracted from each frame in the encoder device and is reproduced into the synthesized speech signal by an interpolation technique in the decoder device. Herein, it is to be noted that intermittent extraction of the excitation multipulses makes it difficult to reproduce the driving sound source signal in the decoder device at a transient portion at which the sound source signal is changed in its characteristic. Such a transient portion appears when a vowel is changed to another vowel on concatenation of vowels in the speech signal and when a voiced sound is changed to another voiced sound. In a frame including such a transient portion, the driving sound source signals reproduced by the use of the interpolation technique is terribly different from actual sound source signals, which results in degradation of the synthesized speech signal in quality.
It is mentioned here that the spectrum parameter for a spectrum envelope is generally calculated in an encoder device by analyzing the input digital speech signals by the use of a linear prediction coding (LPC) technique and is used in a decoder device to form a synthesis filter. Thus, the synthesis filter is formed by the spectrum parameter derived by the use of the linear prediction coding technique and has a filter characteristic determined by the spectrum envelope. However, when female sounds, in particular, "i" and "u" are analyzed by the linear prediction coding technique, it has been pointed out that an adverse influence appears in a fundamental wave and its harmonic waves of a pitch frequency. Accordingly, the synthesis filter has a band width which is narrower than a practical band width determined by a spectrum envelope of practical speech signals. Particularly, the band width of the synthesis filter becomes extremely narrow in a frequency band which corresponds to a first formant frequency band. As a result, no periodicity of a pitch appears in a sound source signal. Therefore, the speech quality of the synthesized speech signal is unfavorably degraded when the sound speech signals are represented by the excitation multipulses extracted by the use of the interpolation technique on the assumption of the periodicity of the sound source.