The present invention relates to speech coding systems, and more particularly to a speech coding system used in telephone communication which is carried out in such a manner that a speech signal is converted into a compressed digital signal on the transmitting side and is reproduced from the compressed digital signal on the receiving side, and suitable for processing a speech signal which is generated in a noisy environment.
The signal waveform is given by a combination of fundamental waveform patterns, each of which appears two to ten times in a time interval of, for example, about 20 msec (hereinafter referred to as a "frame"). In conventional speech analysis-synthesis systems, the transmitting side performs a sampling operation for an input speech signal and extracts transmission parameters indicative of the feature and repetition period (namely, pitch period) of a fundamental waveform pattern from the sampled values of the input speech signal at each frame, and the receiving side reproduces the speech signal on the basis of the transmission parameters.
In the PARCOR (partial auto-correlation) system which is representative of one of the conventional speech analysis-synthesis systems, it is judged whether each of the frames formed in analyzing a speech signal is a voiced frame or unvoiced frame, and a reproducing operation is performed in such a manner that the output of an excitation source for generating white noise is used for the unvoiced frame and a single pulse which represents a fundamental waveform pattern and is generated at an interval equal to the pitch period thereof indicated by the transmission parameters, is used for the voiced frame. The PARCOR system, as mentioned above, uses a simple excitation source, and hence is advantageous in that a speech signal can be coded at a low bit rate but disadvantageous in that the quality of a synthesized speech is degraded. The PARCOR system is described in, for example, an article entitled "An audio response unit based upon partial auto correlation" (IEEE Transaction Communication, Vol. COM-20, pages 792 to 797, Aug., 1972).
Further, systems for improving the quality of a synthesized speech by generating a plurality of pulses representative of a fundamental waveform pattern at an interval equal to the pitch period thereof, are proposed in, for example, an article entitled "A New Model of LPC Excitation for Producing Natural-Sounding Speech at Low Bit Rates" by B. S. Atal and J. R. Remde (Proc. ICASSP 82, Vol. 1, pages 614 to 617, 1982), and an article entitled "A Speech Coding Method Using Thinned-Out Residual" by A. Ichikawa et al. (Proc. ICASSP 85, Vol. 3, pages 961 to 964, 1985).
In the above systems, in order to reduce the number of bits necessary for a coding operation, a pulse train generated at an interval equal to the pitch period of a fundamental waveform pattern is made identical with a pulse train generated at an interval equal to the pitch period of another fundamental waveform pattern, in one frame. In this case, however, information on the position of each pulse is required, and thus the number of pulses generated in one pitch period of a fundamental waveform pattern is limited. Accordingly, the quality of a synthesized speech is not satisfactory.
In order to further improve the quality of a synthesized speed, a system has been proposed for synthesizing a fundamental waveform pattern by using a predetermined number of pulses continuous to each other, in U.S. patent application Ser. No. 878,434 assigned to the assignee of the present invention (corresponding to JP-A-61-296398). In this case, information on the position of each pulse is not required. However, in all of above-mentioned speech analysis-synthesis systems, no attention is paid to the influence of a noisy environment on telephone conversations, for example, the degradation in speech quality of a telephone conversation due to the environment containing noise, for example, from the fan of an air conditioner. According to the conventional speech analysis-synthesis systems, noise which is introduced into the systems through a telephone in a period when a speech pauses, is processed in the same manner as the speech. Accordingly, a frame containing only noise is treated as a voiced frame, and thus transmission parameters extracted from noise are sent to the receiving side, to form a synthesized speech on the basis of the transmission parameters. Accordingly, the synthesized speech which is different from input noise and offensive to the ear of a listener, reaches the ear of the listener in pause of the speech, and thus the listener feels strange.