The present invention relates to a method and apparatus for the analysis and synthesis of voice signals.
Known in the art is a band-division type voice analysis and synthesis system (i.e., a sub-Band Coding System which will be hereinafter referred to as an "SBC system"), which is described in the Bell System Technical Journal, 55 [8], 1976-10, USA. This SBC system divides the frequency band of voice signals into several sub-bands (normally, 4 to 8) of the type shown in FIG. 4 (where these sub-bands are designated by reference numerals 1, 2, 3 and 4), and the output of each sub-band channel is then separately coded and decoded.
A basic configuration of the SBC system is shown in the block diagram of FIG. 5 while FIGS. 6A to 6E explain the operation of various circuits. The SBC system will be further described with reference to the above-mentioned FIGS. 5 and 6A to 6E.
First, the operation of an analyzer will be considered. An analog voice signal which is obtained from a microphone (not shown), or a similar source, is passed through a low-pass filter (not shown) for filtering-out the frequency components exceeding 1/2 of a predetermined sampling frequency. The signal is then converted by an A/D converter (not shown) from the analog form into a digital signal S(n) at a predetermined sampling frequency, where n is a sample number. This digitized input signal S(n) is supplied to a band-pass filter 50. In FIG. 6A this signal is described as a specific band component (W.sub.1k -W.sub.2k). The output signal of the above-mentioned band-pass filter 50 is subjected to cosine modulation by multiplying in a multiplier 51 by a cosine wave (Cos wave) having a W.sub.1k frequency shown in FIG. 6B. The signal is then shifted to the basic band (0-W.sub.k) shown in FIG. 6C. The unwanted frequency components R.sub.k (.omega.) which are formed in this case and exceed 2W.sub.1k (e.g., the components which are shown by broken lines in FIG. 6C) are removed by passing through a low-pass filter 52. Because a signal r.sub.k(n) obtained after passing through filter 52 should be the only component that is below W.sub.k, sampling at the sampling frequency of 2W.sub.k will produce the information which is necessary and sufficient. Therefore, decimation is performed by means of a decimator 53, if necessary, with dropping of the high sampling frequency to the rate 2W.sub.k (a high sampling frequency may be required, e.g., in the case of low-pass translation). The obtained decimated signals are coded by a coder 54, and the coded signals are transmitted to a synthesizer.
Because in the synthesizer the signals are processed entirely opposite to the analyzer, the signals obtained from the analyzer are decoded. More specifically, after decoding the coded signals by a decoder 55, interpolation is performed by an interpolator 56 for the return of the decimated signals to their initial sampling frequency. Output signals of interpolator 56 are demodulated by multiplying in a multiplier 57 by a cosine wave having a frequency of W.sub.1k shown in FIG. 6D and returned from the basic band (0-W.sub.k) to the initial frequency band (W.sub.1k -W.sub.2k), as shown in FIG. 6E. Then all other component of the signal, except for those having the frequency band (W.sub.1k -W.sub.2k), are removed by passing through a band-pass filter 58.
The output from the synthesizer comprises signal Sk(n).
The above-described chain of operation is performed for each sub-band (channel), and finally the outputs of all of the channels are summarized into an output voice signal.
A modification of the SBC system is shown in FIG. 7. This system in general is similar to that of FIG. 5, but in order to reduce the number of circuits, it is realized without band-pass filters 50 and 58.
The circuit shown in FIG. 7 operates in the following manner:
In an analyzer, a digitized input signal S(n) is modulated into a complex signal e.sup.jw.sbsp.k.spsp.n [where .omega..sub.k =(W.sub.1k +W.sub.2k)/2]. This complex signal is then complex-modulated in a multiplier 61a by cosine modulation (modulation wave cos.omega..sub.kn), and in a multiplier 61b by sine modulation (modulation wave sin.omega..sub.kn). The output signals of multipliers 61a and 61b are filtered through low-pass filters 62a and 62b with bandwidths (0-.omega..sub.k /2). The resulting signal from low-pass filter 62a will correspond to the real part a.sub.k(n) of complex signal a.sub.k(n) +jb.sub.k(n), and the resulting signal from low-pass filter 62b will correspond to the imaginary part b.sub.k(n) of complex signal a.sub.k(n) +jb.sub.k(n). The signals a.sub.k(n) and b.sub.k(n) are decimated to frequency W.sub.k by decimators 63a and 63b, respectively, and are coded by a coder 64, and transmitted to a synthesizer. In the synthesizer, the coded signals are decoded by a decoder 65, returned to their initial sampling frequency by interpolators 66a and 66 b, and then subjected to filtering by passing through-a low-pass filters 67a and 67b having a (0-.omega..sub.k /2) bandwidth. The signals are then demodulated in a multiplier 68a by being multiplied by the cosine wave, and in a multiplier 68b by the sine wave. Cosine components and sine components of the signals are added to each other in an adder 69, and the signals of the above-mentioned sub-bands are thus synthesized.
The above-described processing is repeated for each sub-band (channel). Finally, the output signals of all channels are summed, and output voice signals are obtained.
As compared to a system coding a voice signal itself, the SBC system, which operates on the above principle, has the following advantages:
The quantization error of each channel is similar to white noise and spreads over the entire width of the frequency spectrum, but because the noise outside of each individual channel does not fall in the particular channel, the quantization noise can be reduced. Furthermore, the quantization error of each channel is related only to signals to signals within the frequency band of this particular channel, and is such signals as voice with high low-frequency components and low high-frequency components, the errors in the channels of the high-frequency bands are extremely small as compared to the signal as a whole. In addition, the high-frequency components of the voice signal are mainly components of the noise, and the error in this band only slightly affects hearing.
By setting an appropriate division of the speech spectrum and appropriate quantization bit numbers which are given to the signals of respective channels it becomes possible to reduce the required quantity of information to about one half, as compared to a system based on direct coding of the voice signals. For example, in the case of PCM voice signals sampled at 8 KHz, the direct coding, e.g., ADPCM coding requires a quantity of information corresponding approximately to 30 kb/s, whereas in the SBC system, the synthesized sound, almost of the same quality for hearing, can be obtained at about 16 kb/s.
It is desired that sound of high quality be synthesized using a smaller amount of information. Because in general the SBC system is basically a wave-form coding system, information compression in this system is limited to 10 kb/s. As the quantization bit number in this range appears to be insufficient, "roughness" of the synthesized sound is noticeable because of quantization error, or the quality of the sound is lowered because of insufficiency of the band.
As is well known, however, conventional telephone voice signals contain a considerable quantity of silence signal intervals. This is, of course, conversation break pauses, respiration pauses during continuous speech, or bursting sounds which are accompanied by closing time intervals. In total, the silence signals comprise about 20% of the time, and this time, which is useless, is processed in the same manner as the voice intervals which carry information. In addition, systems such as SBC systems with sub-bands, may include channels with an amplitude, as well as channels which are almost without the amplitude. The human ear distinguishes sounds by position and magnitude of a peak (formant) on the spectrum of the voice. Those parts which are in the "valley" portions of the spectrum carry information of relatively low importance. Furthermore, it often happens that sounds which have a low level of voice signals are almost below the noise level. From a practical viewpoint, these portions also can be treated as silence signals, almost without any lose of phonetic properties of the speech. Because in silence compression in the voice analysis and synthesis systems which do not subdivide frequency bands into sub-bands a judgement is made on the collection of sound signals and silence signals over the entire band, with a high slice lever for sound/silence judgment, low power sound signals such as friction sounds can be taken for silence signals and lost, and with a low slice level, pure noise intervals can be taken for sound, and effective compression of information cannot be achieved.
Because, distinct from the noise spectrum, the spectrum of the voice has specific deviations characteristic of the phonetic (vocal) properties of the voice sounds, it is possible to subdivide the voice signals into several sub-bands and to make a judgment on the silence in each separate sub-band. With such an arrangement, even when the voice power is low in an entire band, reservation of components of the sub-band in which the power is concentrated is ensured, while the remaining information of the band containing only noise components is removed. As a result, the phonetic properties of the voice are preserved, while effective information compression is achieved.