This invention relates to an encoder for use in encoding an input signal into an encoded signal in a data transmission network. The input signal may be either a speech signal or a picture signal, although description will mainly be directed to the speech signal.
It is preferable to reduce transmission rate with an eye to reducing cost of a data transmission network since, for higher rates, a larger capacity of memory is indispensable to the network due to the transmission of a large number of information signals resulting from an input signal. A recent demand is directed to the transmission rate of 16 kbits/sec rather than 32 kbits/sec.
In general, each of voiced and unvoiced sounds, such as vowels, nasals, fricatives, and the like, can be represented by a convolution between an impulse generated by a sound source and an impulse response of a vocal tract, as well known in the art. The impulse is usually represented by the Kronecher's delta and includes a pitch pulse generated in response to each voiced sound. In other words, each sound is specified by the impulse and can be reproduced by allowing the impulse to pass through a filter having an impulse response similar to that of the vocal tract.
A speech coder of the type described is proposed in an article which is contributed by Bishnu S. Atal et al of Bell Laboratories to Proc. IASSP, 1982, pages 614-617, under the title of "A New Model of LPC Excitation for Producing Natural-sounding Speech at Low Bit Rates." According to the Atal et al article, each impulse is derived as an excitation pulse from each discrete speech signal within a frame of, for example, 20 milliseconds, formed by dividing the input signal. Pulse instants or locations of the excitation pulses and amplitudes thereof are determined by a so-called analysis-by-synthesis (A-b-S) method. It is believed that the model of Atal et al is useful to reduce the transmission rate. The model, however, requires a great amount of calculation in determining the pulse instants and the pulse amplitudes.
In the meanwhile, a "voice coding system" is disclosed in U.S. patent application Ser. No. 565,804 filed Dec. 27, 1983, U.S. Pat. No. 4,716,592, by Kazunori Ozawa et al for assignment to the present assignee. The voice or speech coding system of the Ozawa et al patent is for coding a discrete speech signal sequence of the type described into an encoded signal.
In the speech coding system of the Ozawa et al patent, the amplitude and the pulse instant of each excitation pulse are determined at each frame with reference to both of an autocorrelation of an impulse response of an analyzer and a cross-correlation between the input signal and the impulse response of the analyzer.
More particularly, the input signal can be synthesized by linear combinations of impulses, such as the pitch pulses, and the impulse responses of the analyzer, respectively, when the analyzer exhibits the same impulse response as those of the vocal tract. For simplicity of description, distinction will not be made as regards the relation between the impulse response of the analyzer and those of the vocal tract any longer on the assumption that the analyzer and the vocal tract have the same impulse responses.
Under the circumstances, the cross-correlation between the input signal and the impulse response of the analyzer is specified by a sequence of scalar products of the pitch pulses and an autocorrelation of the impulse response and has a succession of peaks corresponding to the pitch pulses. In other words, the above-mentioned cross-correlation can be represented by the autocorrelation of the impulse response and the excitation pulses placed at the peaks with the amplitudes of the excitation pulses identical with those of the peaks, respectively.
Practically, one of the excitation pulses is determined in each frame by searching for a maximum one of the peaks and is multiplied by each autocorrelation to calculate one of the products. The calculated one of the products is subtracted from the cross-correlation. The resultant or remaining cross-correlation is thereafter subjected to similar processing to successively determine the remaining excitation pulses.
With the system according to the Ozawa et al patent, instants of the respective excitation pulses and amplitudes thereof are determined or calculated with a drastically reduced amount of calculation. The system is, however, not enough to encode actual original speech signals because no consideration is paid to interaction between two adjacent frames.
More particularly, the actual original speech signals continuously run through a plurality of frames. This means that any one of the pitch pulses, may be produced at an end of a current one of the frames, wherein the current frame is succeeded by a following one of the frames. In this event, an impulse response which results from the pitch pulse remains largely within the following frame as a remnant impulse response. Inasmuch as the excitation pulses are determined and calculated at every frame in the speech coding system mentioned above, the remnant impulse response may cause any undesired excitation pulses to occur in the following frame. Accordingly, such undesired excitation pulses may be added to desired excitation pulses in the following frame.
Inasmuch as the remnant impulse response usually lasts for a significant portion of each frame, the quality of the reproduced voice or speech is inevitably degraded by occurrence of the undesired excitation pulses.