1. Field of the Invention
The present invention relates to a signal encoding system for encoding digital signals such as voice or sound signals with a high efficiency and a signal decoding system for decoding these encoded signals.
2. Description of the Prior Art
In signal encoding for compressing voice or sound signals into smaller information containing units, it is normal practice to select codes so that a preset distortion will be minimized. It is desirable that the measure of such a distortion matches the auditory sense of a human being. When a voice signal is to be encoded and if such a voice signal is superimposed by a noise signal, it is desirable to use a system capable of suppressing the noise component.
It is known that the human auditory system has a non-linear frequency response and a higher discrimination at lower frequencies and lower discrimination at higher frequencies. Such a discrimination is called the critical band width, and the frequency response is called the bark scale.
It is also known that the human auditory system has a certain sensitivity relating to the level of sound, that is, a loudness, which is not linearly proportional to the signal power. Signal powers providing an equal loudness are slightly different from one another, depending on the frequency. If a signal power is relatively large, a loudness is approximately calculated from the exponential function of the signal power multiplied by one of a number of coefficients that are slightly different from one another for every frequency.
It is further known that one of the characteristics of the human auditory system is a masking effect. The masking effect is where, if there is a disturbing sound, it will increase the minimum audible level at which the other signals can be perceived. The magnitude of the masking effect increases as a frequency to be used approaches the frequency of the disturbing sound, and varies depending on the width of differential frequency along the bark scale.
The details of such characteristics and their modeling in the human auditory system are described in Eberhard Zwicker, "Psychologic Acoustics", pp161-174, which was translated by YAMADA Yukiko and published by HISHIMURA SHOTEN, 1992.
Some signal encoding systems using a distortion scale well matching these auditory characteristics are described, for example, in Japanese Patent Laid-Open Nos. Hei 4-55899, Hei 5-268098 and Hei 5-15849.
Japanese Patent Laid-Open No. Hei 4-55899 introduces a distortion which is well matched to these auditory characteristics when the spectrum parameters of voice signals are encoded. The spectral envelope of the voice signals is first approximated to an all pole model, and certain parameters are then extracted as spectral parameters. The spectral parameters are subjected to a non-linear transform such as conversion into mel-scale and then encoded using a square-law distance as a distortion scale. The non-linearity of the frequency response in the human auditory system is thus introduced by the conversion to the mel-scale.
Japanese Patent Laid-Open No. Hei 5-268098 introduces a bark scale when the spectral forms of voice signals are substantially removed through short- and long-term forecasts, the residual signals then being encoded. The residual signals are converted into frequency domains. All the frequency components thus obtained are brought into a plurality of groups, each of which is represented only by grouped amplitudes spaced apart from one another with regular intervals on the bark scale. These grouped amplitudes are finally encoded. The introduction of grouped amplitudes provides an advantage in that the frequency axis is approximate conversion into a bark scale to improve the matching of the distortion in the encoding step or grouped amplitude to the auditory characteristics.
Japanese Patent Laid-Open No. Hei 5-158495 is to execute a plurality of voice encodings through auditory weighting filters having different characteristics so that an auditory weighting filter providing the minimum sense of noise will be selected. One method of evaluating the sense of noise is described, which calculates an error between an input voice signal and a synthesized signal and determines a loudness of such a error relative to the input voice signal, that is, noise loudness. The calculation of loudness also uses the critical band width and masking effect.
Another method of using a distortion scale well matched to the auditory characteristics is disclosed in S. Wang, A. Sekey and A. Gersho, "Auditory Distortion Measure for Speech Coding" (Proc. IC ASSP'91, pp.493-496, May 1991).
The S. Wang et al. method uses a parameter called a bark spectrum which is obtained by performing integration of the amplitude in the critical band of the frequency spectrum, pre-emphasis for equal loudness compensation and sone conversion into loudness. The bark spectra of the input voice and synthesized signals are then calculated to provide a simple square-law error between these two bark spectra, which is in turn used to evaluate a distortion between the input voice and synthesized signals. The integration of critical band models the non-linearity of the frequency axis in the auditory characteristics as well as the masking effect. The pre-emphasis and sone conversion model the characteristics relating to the loudness in the auditory characteristics.
A method of suppressing noise superimposed on voice signals is also known by S. F. Boll, "Suppression of Acoustic Noise in Speech Using Spectral Subtraction" (IEEE Trans. on Acoustics, Speech and Signal Processing, Vol. ASSP-27, No.2, pp.113-120, April 1979).
The S. F. Boll method presumes the spectral form of noise from non-speech sections and subtracts it from the spectra of all sections for suppressing the noise components in the following manner.
First of all, input signals are cut by hanning window for regular time intervals and converted into frequency spectra through the Fast Fourier Transform (FFT). The power of each of the frequency spectral components is then calculated to determine a power spectrum. The power spectra determined through a section judged to be a non-speech section are averaged to presume an average power spectrum of noise. The power spectrum of noise multiplied by a given gain is then subtracted from the power spectra throughout all the sections. Thus, variable noise components may instead be realized through the subtraction of noise to increase the sense of noise. Therefore, components made to be very small values through the subtraction are leveled to equal to the values in the previous and next sections after the subtraction. It is then returned to an original signal by applying inverse FFT onto a frequency spectrum which has a phase spectrum equal to that of the frequency spectrum of the input signal and a power spectrum equal to the power spectrum after the leveling step. Finally, the resulting signal is reconstructed by maintaining it for a given time period.
However, the methods of the prior art have the following problems:
In Japanese Patent Laid-Open No. Hei 4-55899, the spectral envelop of voice signals approximates to the all pole model which is based on a voice signal generating mechanism. The optimum parameter order of the all pole model depends on vowel, consonant and/or speaker. Therefore, good approximation is not necessarily performed. To improve this problem, a system of presuming and determining the optimum parameter order has been proposed, but is rarely used because of its complicated analysis and synthesis. Voice signals superimposed by background or other noises raise another problem in that the all pole model will not be approximated. This method cannot overcome the above problem since only the non-linear conversion is executed for the parameter based on the all pole model to convert the frequency into a frequency well matching the auditory characteristics. Since the factors, such as loudness, masking effect and others, of the auditory characteristics are not contained therein, the resulting parameters will not be sufficiently matched to the auditory characteristics. The all pole model cannot be applied to the method of the prior art to encode sound signals well matching the auditory characteristics since the all pole model does not conform to general audio signals other than voice signals.
In place of the conversion into mel-scale, the parameter based on the all pole model may be temporarily converted into a frequency spectrum which is in turn converted into a bark spectrum. Therefore, the distortion scale used to encode the parameter based on the all pole model may be a bark spectrum distortion. Since such a conversion requires a very large amount of data to be processed, however, it can be used only in performing a vector quantization in which the conversion of all the codes has previously be made. The all pole model has further problems which are not expected to be improved in the near future.
Japanese Patent Laid-Open No. Hei 5-268098 uses the bark scale in encoding the residual signals. The bark scale only relates to the non-linearity of the frequency axis among the auditory characteristics and does not contain the other factors, such as loudness and/or masking effect, of the auditory characteristics. Therefore, the bark scale does not sufficiently match the auditory characteristics. An auditory model becomes significant only when it is applied to signals inputted into a person's ears. When the auditory model is applied to the residual signals as in the prior art, it cannot introduce the factors of the auditory characteristics other than the non-linearity of the frequency axis.
Japanese Patent Laid-Open No. Hei 5-158495 uses the noise loudness as a distortion scale for selecting the auditory weighting filter. This can only be used to select the auditory weighting filter, and cannot be used to provide a distortion scale in encoding voice signals. Such a distortion scale uses a signal distortion after the auditory weighting filter which weights a distortion created by the encoding in the axis of frequency so as to be hardly audible, based on the all pole model. Thus, the auditory weighting filter is empirically determined, but does not fully use the bark scale, loudness and masking in the auditory characteristics. In addition, the auditory weighting filter does not adapt to general audio signals other than voice signals since it is introduced from the parameters of the all pole model.
To improve such a method of the prior art, it may be proposed to introduce the concept of noise loudness as a distortion scale used on encoding. However, it must generate decoded signals for all the different codes of B powers of two (B: the number of bits of codes) and calculate noise loudness for all the decoded signals. This requires a huge amount of data to be processed, and cannot actually be realized.
The method of S. Wang et al. calculates a bark spectrum as a parameter based on an auditory model. However, its object is to evaluate various encoding systems through evaluation of bark spectrum distortions in decoded signals, but does not consider to use it as a distortion scale on encoding. If decoded signals can be generated for all the codes of B powers of two (B: the number of bits of codes) and bark spectra can be calculated for all the decoded signals, one may determine a codeword having the minimum bark spectrum distortion. However, this must also process a huge amount of data, and cannot actually be realized.
The method of S. F. Boll cuts input voices through a hanning window for regular time intervals for suppressing noise. The length of the hanning window and time interval become powers of two depending on the FFT. Although a voice encoding system also cuts input voices for regular time intervals, the time interval is not necessarily equal to that of the noise processing. Thus, the voices will be independently encoded after the noise suppression has been completed. This requires a large amount of data to be processed as well as a large amount of memory, with a complicated backfiling of signals. If these time intervals are coincident with each other, there are required more calculation and memory which are at least proportional to the number of points (256, 512, 1024, etc.) in the FFT.
Although the method of S. F. Boll actually reduces noise components through the subtraction of noise, the variations actually increase the auditory sense of noise. To improve such a problem, the S. F. Boll method simply levels the spectra. This is insufficient to improve the above problem relating to a certain form of noise.