There have been proposed a variety of audio signal compression methods of this type and, hereinafter, one example of those methods will be described.
Initially, a time series of an input audio signal is transformed into a frequency characteristic signal sequence for each length of a specific period (frame) by MDCT (modified discrete cosine transform), FFT (fast Fourier transform) or the like. Further, the input audio signal is subjected to linear predictive analysis (LPC analysis), frame by frame, to extract LPC coefficients (linear predictive coefficients), LSP coefficients (line spectrum pair coefficients), PARCOR coefficients (partial auto-correlation coefficients) or the like, and an LAC spectrum envelop is obtained from these coefficients. Next, the frequency characteristic is flattened by dividing the calculated frequency characteristic signal sequence with the LPC spectrum envelope and normalizing it, and then the power is normalized using the maximum value or the mean value of the power.
In the following description, output coefficients at the power normalization are called "residual signals". Further, the flattened residual signals are vector-quantized using the spectrum envelope as a weight.
As an example of such audio signal compression method, there is TwinVQ (Iwagami, Moriya, Miki: "Audio Coding by Frequency-Weighted Interleave Vector Quantization (TwinVQ)" Anthology of Lectured Papars of Acoustic Society, 1-P-1, pp.3390-340, 1994).
Next, a speech signal compression method according to a prior art will be described.
First of all, a time series or an input speech signal is subjected to LPC analysis for each frame, whereby it is divided into LPC spectrum envelope components, such as LPC coefficients, LSP coefficients, or PARCOR coefficients, and residual signals, the frequency characteristic of which is flattened. The LPC spectrum envelope components are Scalar-quantized, and the flattened residual signals are quantized according to a previously prepared sound source code book, whereby the components and the signals are transformed into digital signals, respectively.
As an example of such speech signal compression method, there is CELP (M. R. Schroeder and B. S. Atal, "Code-excited Linear Prediction (CELP) High Quality Speech at Very Low Rates", Proc. ICASSP-85, March 1085).
Further, a speech recognition method according to a prior art will be described.
Generally, in a speech recognition apparatus, speech recognition is performed as follows. A standard model for each phoneme or word is formed in advance by using speech data as a base, and a parameter corresponding to a spectrum envelope is obtained from an input speech. Then, the similarity between the time series of the input speech and the standard model is calculated, and a phoneme or word corresponding to the standard model having the highest similarity is found. In this case, hidden Markov model (HMM) or the time series itself of a representative parameter is used as the standard model (Seiici Nakagawa "Speech Recognition by Probability Model", Edited by Electronics Information and Communication Society, pp.18-80.)
Conventionally, recognition is performed using, as a time series of a parameter obtained from an input speech, the following cepstrum coefficients: LPC cedstrum coefficients which are obtained by transforming a time series of an input speech into LPC coefficients for each length of a specific period (frame) by LPC analysis and then subjecting the resulting LPC coefficients to cepstrum transform ("Digital Signal Processing of Speech and Audio Information", by Kiyohiro Sikano, Sazosi Nakamura, Siro Ise, Shyokodo, pp.10-16), or cepstrum coefficients which are obtained by transforming an input speech into power spectrums for each length of a specific period (frame) by DFT or band pass filter bank and then subjecting the resulting power spectrums to cepstrum transformation.
In the prior art audio signal compression method, residual signals are obtained by dividing a frequency characterized signal sequence calculated by MDCT or FFT with an LPC spectrum envelop, and normalizing the result.
On the other hand, in the prior art speech signal compression method, an input audio signal is separated into an LPC spectrum envelope calculated by LPC analysis and residual signals. The prior art audio signal compression method and the prior art speech signal compression method are similar in that spectrum envelop components are removed from the input signal by the standard LPC analysis, i.e., residual signals are obtained by normalizing (flattening) the input signal by the spectrum envelope. Therefore, if the performance of this LPC analysis is improved or the estimated precision of the spectrum envelop obtained by the LPC analysis is increased, it is possible to compress information more efficiently than the prior art methods while maintaining a high sound quality.
In the standard LPC analysis, an envelop is estimated with a frequency resolution of the same precision for each frequency band. Therefore, in order to increase the frequency resolution for a low frequency band which is auditively important, i.e., in order to obtain a spectrum envelop of a low frequency band precisely, the analysis order must be increased, resulting in increased amount of information.
Further, to increase the analysis order results in an unnecessary increase in resolution for a high frequency band which is not auditively very important. In this case, calculation of a spectrum envelop having a peak in a high frequency band might be required, thereby degrading the sound quality.
Furthermore, in the prior art audio signal compression method, when vector quantization is performed, weighting is carried out on the basis of a spectrum envelop alone. Therefore, efficient quantization utilizing human auditory characteristics is impossible in the standard LPC analysis.
In the prior art speech recognition method, if LPC cepstrum coefficients obtained by the standard LPC analysis are used for the recognition, sufficient recognition performance might not be done because the LPC analysis is not based on human auditory characteristics.
It is well known that the human hearing fundamentally has a tendency to regard low-band frequency components as important and regard high-band frequency components as less important than the low-band components.
There is proposed a recognition method based on such tendency wherein recognition is performed using mel-LPC coefficients which are obtained by subjecting the LPC cepstrum coefficients to mel-transformation ("Digital Signal Processing of Speech and Audio Information", by Kiyohiro Sikano, Satosi Nakamura, Siro Ise, Shyokodo, pp.39.about.40). However, in the LPC analysis for producing LPC cepstrum coefficients, human auditory characteristics are not sufficiently considered and, therefore, low-band information which is auditively important is not sufficiently reflected in LPC mel-cepstrum coefficients obtained by subjecting the cepstrum coefficients to mel transformation.
Mel-frequency scale is a scale obtained from pitch perceptivity characteristics of human beings. It is well known that the pitch depends on the intensity of sound as well as the frequency. So, a pure sound of 100 Hz and 40 dB SPL is used as a reference sound of 1000 mel, and sounds perceived as double and half in pitch are measured by magnitude measurement or the like and decided as 2000 mel and 500 mel, respectively. However, since human auditory characteristics are not sufficiently considered in the LPC analysis for producing the LPC cepstrum coefficients as described above improvement of the substantial recognition performance cannot be expected even if mel-transformation is performed.
Further, in the standard LPC analysis, a spectrum envelop is estimated with the same frequency resolution for each frequency band. Therefore, in order to increase the frequency resolution for a low frequency band which is auditively important, i.e., obtain a spectrum envelop of a low frequency band precisely, the analysis order must be increased, resulting in increased parameters and increased throughput for recognition. Furthermore, to increase the analysis order results in an unnecessary increase in resolution for a high frequency band and, thereby, the high frequency band may have an unnecessary feature, degrading the recognition performance.
There is another speech recognition method wherein speech recognition is performed using cepstrum coefficients or mel-cepstrum coefficients as parameters. In this method, however, the computational complexity of DFT or band-pass filter bank is rather high than those in the LPC analysis.
The present invention is made to solve the above-described problems, in view of the fact that the speech recognition performance can be improved by using the following coefficients: mel-LPC coefficients obtained as a result of an LPC analysis of improved performance, i.e., based on human auditory characteristics (hereinafter referred to as "mel-LPC analysis"); mel-PARCOR coefficients obtained from mel-LPC coefficients by a well-known method similar to the method of obtaining PARCOR coefficients from standard LPC coefficients; mel-LSP coefficients obtained from mel-LSP coefficients by a well-known method similar to the method of obtaining LSP coefficients from standard LPC coefficients; or mel-LPC cepstrum coefficients obtained by subjecting mel-LPC coefficients to cepstrum transformation.
To improve the audio or speech signal compression performance or the speech recognition performance using these mel-coefficients has conventionally been supposed, but it has never been actually carried out because of the enormous amount of computation.
In the prior arts, infinite operation is required to calculate these coefficients and, if the operation is limited, it brings errors. The inventors found, as the result of vigorous studies in view of the existing state, that there is a brand-new operation that can provide an operation equivalent to the infinite operation without any error, by only performing the new operation a prescribed number of times.