The present invention relates a method and an apparatus for compressing an audio signal obtained by transforming music into an electric signal, and a method and an apparatus for compressing a speech signal obtained by transforming speech into an electric signal, which are capable of compressing the audio signal or the speech signal more efficiently than conventional methods and apparatuses while maintaining a high sound quality, in particular, when compressing the audio signal or the speech signal using a weighting function on frequency based on human auditory characteristics, in order to enable information transmission of the audio signal or the speech signal by a transmission line of a small capacity and efficient storage of the audio signal or the speech signal into recording media.
The present invention further relates to a method and an apparatus for recognizing speech, which are capable of providing a higher recognition rate than conventional methods and apparatuses, in particular, when performing recognition using parameters having different resolutions for different frequencies, which parameters are obtained by a linear prediction coding analysis utilizing human auditory characteristics.
There have been proposed a variety of audio signal compression methods of this type and, hereinafter, one example of those methods will be described,
Initially, a time series of an input audio signal is transformed into a frequency characteristic signal sequence for each length of a specific period (frame) by MDCT (modified discrete cosine transform), FFT (fast Fourier transform) or the like. Further, the input audio signal is subjected to linear predictive analysis (LPC analysis), frame by frame, to extract LPC coefficients (linear predictive coefficients), LSP coefficients (line spectrum pair coefficients), PARCOR coefficients (partial auto-correlation coefficients) or the like, and an LPC spectrum envelop is obtained from these coefficients. Next, the frequency characteristic is flattened by dividing the calculated frequency characteristic signal sequence with the LPC spectrum envelope and normalizing it, and then the power is normalized using the maximum value or the mean value of the power.
In the following description, output coefficients at the power normalization are called xe2x80x9cresidual signalsxe2x80x9d. Further, the flattened residual signals are vector-quantized using the spectrum envelope as a weight.
As an example of such audio signal compression method, there is TwinVQ (Iwagami, Moriya, Miki: xe2x80x9cAudio Coding by Frequency-Weighted Interleave Vector Quantization (TwinVQ)xe2x80x9d Anthology of Lectured Papars of Acoustic Society, 1-P-1, pp.3390-340, 1994).
Next, a speech signal compression method according to a prior art will be described.
First of all, a time series of an input speech signal is subjected to LPC analysis for each frame, whereby it is divided into LPC spectrum envelope components, such as LPC coefficients, LSP coefficients, or PARCOR coefficients, and residual signals, the frequency characteristic of which is flattened. The LPC spectrum envelope components are Scalar-quantized, and the flattened residual signals are quantized according to a previously prepared sound source code book, whereby the components and the signals are transformed into digital signals, respectively.
As an example of such speech signal compression method, there is CELP (M. R. Schroeder and B. S. Atal, xe2x80x9cCode-excited Linear Prediction (CELP) High Quality Speech at Very Low Ratesxe2x80x9d, Proc. ICASSP-85, March 1085).
Further, a speech recognition method according to a prior art will be described.
Generally, in a speech recognition apparatus, speech recognition is performed as follows. A standard model for each phoneme or word is formed in advance by using speech data as a base, and a parameter corresponding to a spectrum envelope is obtained from an input speech. Then, the similarity between the time series of the input speech and the standard model is calculated, and a phoneme or word corresponding to the standard model having the highest similarity is found. In this case, hidden Markov model (HMM) or the time series itself of a representative parameter is used as the standard model (Seiichi Nakagawa xe2x80x9cSpeech Recognition by Probability Modelxe2x80x9d, Edited by Electronics Information and Communication Society, pp. 18-20.)
Conventionally, recognition is performed using, as a time series of a parameter obtained from an input speech, the following cepstrum coefficients: LPC cepstrum coefficients which are obtained by transforming a time series of an input speech into LPC coefficients for each length of a specific period (frame) by LPC analysis and then subjecting the resulting LPC coefficients to cepstrum transform (xe2x80x9cDigital Signal Processing of Speech and Audio Informationxe2x80x9d, by Kiyohiro Sikano, Satosi Nakamura, Siro Ise, Shyokodo, pp. 10-16), or cepstrum coefficients which are obtained by transforming an input speech into power spectrums for each length of a specific period (frame) by DFT or band pass filter bank and then subjecting the resulting power spectrums to cepstrum transformation.
In the prior art audio signal compression method, residual signals are obtained by dividing a frequency characteristic signal sequence calculated by MDCT or FFT with an LPC spectrum envelop, and normalizing the result.
On the other hand, in the prior art speech signal compression method, an input audio signal is separated into an LPC spectrum envelope calculated by LPC analysis and residual signals. The prior art audio signal compression method and the prior art speech signal compression method are similar in that spectrum envelop components are removed from the input signal by the standard LPC analysis, i.e., residual signals are obtained by normalizing (flattening) the input signal by the spectrum envelope. Therefore, if the performance of this LPC analysis is improved or the estimated precision of the spectrum envelop obtained by the LPC analysis is increased, it is possible to compress information more efficiently than the prior art methods while maintaining a high sound quality.
In the standard LPC analysis, an envelop is estimated with a frequency resolution of the same precision for each frequency band. Therefore, in order to increase the frequency resolution for a low frequency band which is auditively important, i.e., in order to obtain a spectrum envelop of a low frequency band precisely, the analysis order must be increased, resulting in increased amount of information.
Further, to increase the analysis order results in an unnecessary increase in resolution for a high frequency band which is not auditively very important. In this case, calculation of a spectrum envelop having a peak in a high frequency band might be required, thereby degrading the sound quality.
Furthermore, in the prior art audio signal compression method, when vector quantization is performed, weighting is carried out on the basis of a spectrum envelop alone. Therefore, efficient quantization utilizing human auditory characteristics is impossible in the standard LPC analysis.
In the prior art speech recognition method, if LPC cepstrum coefficients obtained by the standard LPC analysis are used for the recognition, sufficient recognition performance might not be done because the LPC analysis is not based on human auditory characteristics.
It is well known that the human hearing fundamentally has a tendency to regard low-band frequency components as important and regard high-band frequency components as less important than the low-band components.
There is proposed a recognition method based on such tendency wherein recognition is performed using mel-LPC coefficients which are obtained by subjecting the LPC cepstrum coefficients to mel-transformation (xe2x80x9cDigital Signal Processing of Speech and Audio Informationxe2x80x9d, by Kiyohiro Sikano, Satosi Nakamura, Siro Ise, Shyokodo, pp. 39xcx9c40). However, in the LPC analysis for producing LPC cepstrum coefficients, human auditory characteristics are not sufficiently considered and, therefore, low-band information which is auditively important is not sufficiently reflected in LPC mel-cepstrum coefficients obtained by subjecting the cepstrum coefficients to mel transformation.
Mel-frequency scale is a scale obtained from pitch perceptivity characteristics of human beings. It is well known that the pitch depends on the intensity of sound as well as the frequency. So, a pure sound of 100 Hz and 40 dB SPL is used as a reference sound of 1000 mel, and sounds perceived as double and half in pitch are measured by magnitude measurement or the like and decided as 2000 mel and 500 mel, respectively. However, since human auditory characteristics are not sufficiently considered in the LPC analysis for producing the LPC cepstrum coefficients as described above, improvement of the substantial recognition performance cannot be expected even if mel-transformation is performed.
Further, in the standard LPC analysis, a spectrum envelop is estimated with the same frequency resolution for each frequency band. Therefore, in order to increase the frequency resolution for a low frequency band which is auditively important, i.e., obtain a spectrum envelop of a low frequency band precisely, the analysis order must be increased, resulting in increased parameters and increased throughput for recognition. Furthermore, to increase the analysis order results in an unnecessary increase in resolution for a high frequency band and, thereby, the high frequency band may have an unnecessary feature, degrading the recognition performance.
There is another speech recognition method wherein speech recognition is performed using cepstrum coefficients or mel-cepstrum coefficients as parameters. In this method, however, the computational complexity of DFT or band-pass filter bank is rather high than those in the LPC analysis.
The present invention is made to solve the above-described problems, in view of the fact that the speech recognition performance can be improved by using the following coefficients: mel-LPC coefficients obtained as a result of an LPC analysis of improved performance, i.e., based on human auditory characteristics (hereinafter referred to as xe2x80x9cmel-LPC analysisxe2x80x9d); mel-PARCOR coefficients obtained from mel-LPC coefficients by a well-known method similar to the method of obtaining PARCOR coefficients from standard LPC coefficients; mel-LSP coefficients obtained from mel-LSP coefficients by a well-known method similar to the method of obtaining LSP coefficients from standard LPC coefficients; or mel-LPC cepstrum coefficients obtained by subjecting mel-LPC coefficients to cepstrum transformation.
To improve the audio or speech signal compression performance or the speech recognition performance using these mel-coefficients has conventionally been supposed, but it has never been actually carried out because of the enormous amount of computation.
In the prior arts, infinite operation is required to calculate these coefficients and, if the operation is limited, it brings errors. The inventors found, as the result of vigorous studies in view of the existing state, that there is a brand-new operation that can provide an operation equivalent to the infinite operation without any error, by only performing the new operation a prescribed number of times.
It is an object of the present invention to provide an audio signal compression method, an audio signal compression apparatus, a speech signal compression method, a speech signal compression apparatus, a speech recognition method, and a speech recognition apparatus, which realize improvement of compression performance of audio and speech signals and improvement of speech recognition performance by performing weighting of frequency based on human auditory characteristics by using the new operation described above.
In other words, it is an object of the present invention to provide an audio signal compression method, an audio signal compression apparatus, a speech signal compression method, and a speech signal compression apparatus, which can compress audio or speech signals more efficiently than the prior art methods and apparatus while maintaining a high sound quality by improving the performance of LPC analysis using a spectrum envelop based on a weighting function of frequency adapted to human auditory characteristics or by increasing the precision in estimation of a spectrum envelop obtained by LPC analysis.
It is another object of the present invention to provide a speech recognition method and a speech recognition apparatus, which can recognize the feature of a spectrum envelope efficiently even with less parameters since parameters corresponding to the spectrum envelope are obtained by mel-LPC analysis using a weighting function of frequency based on human auditory characteristics, and realize high recognition performance with less processing amount than that of the prior art methods and apparatus, by using the parameters.
According to a first aspect of the present invention, an audio signal compression method for compressively coding an input audio signal includes the steps of: calculating a spectrum envelope having different resolutions for different frequencies, from the input audio signal, using a weighting function of frequency based on human auditory characteristics; and flattening the input audio signal for each frame using the calculated spectrum envelope.
According to a second aspect of the present invention, an audio signal compression method for compressively coding an input audio signal includes the steps of: transforming the input signal into a frequency-warped signal with an all-pass filter, using a weighting function of frequency based on human auditory characteristics; obtaining a spectrum envelope having different resolutions for different frequencies, by performing linear predictive analysis of the frequency-warped signal; and flattening the input audio signal for each frame using the spectrum envelope.
According to a third aspect of the present invention, an audio signal compression method for compressively coding an input audio signal includes the steps of: performing mel-linear predictive analysis including frequency warping in a prediction model, thereby obtaining a spectrum envelope having different resolutions for different frequencies, from the input audio signal, using a weighting function of frequency based on human auditory characteristics; and flattening the input audio signal for each frame using the spectrum envelope.
According to a fourth aspect of the present invention, there is provided an audio signal compression method for compressively coding an input audio signal, which method has the step of performing mel-linear predictive analysis including frequency warping in a prediction model, thereby calculating a spectrum envelope having different resolutions for different frequencies, from the input audio signal, using a weighting function of frequency based on human auditory characteristics. The mel-linear predictive analysis comprises the steps of: cutting out an input signal of a specific time length from the input audio signal, and filtering the signal of the time length using multiple stages of all-pass filters to obtain output signals from the respective filters; obtaining a correlation function on a mel-frequency axis by performing a product-sum operation between the input signal and the output signal from each filter, which product-sum operation is performed within a range restricted to the time length of the input signal as represented by the following formula,       φ    ⁢          (              i        ,        j            )        =            ∑              n        =        0                    N        -        1              ⁢                  x        ⁡                  [          n          ]                    ·                        y                      (                          i              -              j                        )                          ⁡                  [          n          ]                    
wherein xcfx86 (i,j) is the correlation function, x[n] is the input signal, and y(ixe2x88x92j) [n] is the output signal from each filter; obtaining mel-linear predictive coefficients from the correlation function on the mel-frequency axis; and using the mel-linear predictive coefficients as a spectrum envelope, or obtaining a spectrum envelope from the mel-linear predictive coefficients.
According to a fifth aspect of the present invention, an audio signal compression apparatus for compressively coding an input audio signal comprises: time-to-frequency transformation means for transforming the input audio signal to a frequency domain signal; spectrum envelope calculation means for calculating a spectrum envelope having different resolutions for different frequencies, from the input audio signal, using a weighting function of frequency based on human auditory characteristics; normalization means for normalizing the frequency domain signal with the spectrum envelope to obtain a residual signal; power normalization means for normalizing the residual signal with the power; auditory weighting calculation means for calculating weighting coefficients of frequency, based on the spectrum of the input audio signal and human auditory characteristics, and multi-stage quantization means having plural stages of vector quantizers connected in series, to which the normalized residual signal is input, and at least one of the vector quantizers quantizing the residual signal using the weighting coefficients.
According to a sixth aspect of the present invention, an audio signal compression apparatus for compressively coding an input audio signal comprises: mel-parameter calculation means for calculating mel-linear predictive coefficients on a mel-frequency axis which represents a spectrum envelope having different resolutions for different frequencies, from the input audio signal, using a weighting function of frequency based on human auditory characteristics; parameter transformation means for transforming the mel-linear predictive coefficients to parameters representing a spectrum envelope, such as linear predictive coefficients on a linear frequency axis; envelope normalization means for normalizing the input audio signal by inversely filtering it with the parameters representing the spectrum envelope, to obtain a residual signal; power normalization means for normalizing the residual signal using the maximum value or mean value of the power to obtain a normalized residual signal; and vector quantization means for vector-quantizing the normalized residual signal using a residual code book to transform the residual signal into residual codes.
According to a seventh aspect of the present invention, there is provided a speech signal compression method for compressively coding an input speech signal, which method has the step of performing mel-linear predictive analysis including frequency warping in a prediction model, thereby calculating a spectrum envelope having different resolutions for different frequencies, from the input speech signal, using a weighting function of frequency based on human auditory characteristics. The mel-linear predictive analysis comprises the steps of: cutting out an input signal of a specific time length from the input speech signal, and filtering the signal of the time length using multiple stages of all-pass filters to obtain output signals from the respective filters; obtaining a correlation function on a mel-frequency axis by performing a product-sum operation between the input signal and the output signal from each filter, which product-sum operation is performed within a range restricted to the time length of the input signal as represented by the following formula,       φ    ⁢          (              i        ,        j            )        =            ∑              n        =        0                    N        -        1              ⁢                  x        ⁡                  [          n          ]                    ·                        y                      (                          i              -              j                        )                          ⁡                  [          n          ]                    
wherein xcfx86 (i,j) is the correlation function, x[n] is the input signal, and y(ixe2x88x92j)[n] is the output signal from each filter; obtaining mel-linear predictive coefficients from the correlation function on the mel-frequency axis; and using the mel-linear predictive coefficients as a spectrum envelope, or obtaining a spectrum envelope from the mel-linear predictive coefficients.
According to an eighth aspect of the present invention, a speech signal compression apparatus for compressively coding an input audio signal comprises: mel-parameter calculation means for calculating mel-linear predictive coefficients on a mel-frequency axis which represents a spectrum envelope having different resolutions for different frequencies, from the input speech signal, using a weighting function of frequency based on human auditory characteristics; parameter transformation means for transforming the mel-linear predictive coefficients to parameters representing a spectrum envelope, such as linear predictive coefficients on a linear frequency axis; envelope normalization means for normalizing the input signal by inversely filtering it with the parameters representing the spectrum envelope, to obtain a residual signal; power normalization means for normalizing the residual signal using the maximum value or mean value of the power to obtain a normalized residual signal; and vector quantization means for vector-quantizing the normalized residual signal using a residual code book to transform the residual signal into residual codes.
According to a ninth aspect of the present invention, there is provided a speech recognition method wherein parameters corresponding to a spectrum envelope are calculated from an input speech, by a linear predictive analysis method for calculating a spectrum envelope having different resolutions for different frequencies, using a weighting function of frequency based on human auditory characteristics; and the input speech is recognized using the parameters.
According to a tenth aspect of the present invention, a speech recognition method includes a method for obtaining a spectrum envelope based on human auditory characteristics from an input speech, which method comprises the steps of: transforming the input speech into a frequency-warped speech signal using an all-pass filter; and subjecting the frequency-warped speech signal to linear predictive analysis to obtain parameters corresponding to a spectrum envelope having different resolutions for different frequencies, and the input speech is recognized using the parameters so obtained.
According to an eleventh aspect of the present invention, a speech recognition method employs a mel-linear predictive analysis method including frequency warping in a prediction model as a method for obtaining parameters corresponding to a spectrum envelope based on human auditory characteristics from an input speech, and recognizes the input speech using the parameters.
According to a twelfth aspect of the present invention, a speech recognition method employs the following steps as a method for obtaining parameters corresponding to a spectrum envelope based on human auditory characteristics from an input speech: cutting out an input signal of a specific time length from an input speech, and filtering the signal of the time length using multiple stages of all-pass filters to obtain output signals from the respective filters; obtaining a correlation function on a mel-frequency axis by performing a product-sum operation between the input signal and the output signal from each filter, which product-sum operation is performed within a range restricted to the time length of the input signal as represented by the following formula,       φ    ⁢          (              i        ,        j            )        =            ∑              n        =        0                    N        -        1              ⁢                  x        ⁡                  [          n          ]                    ·                        y                      (                          i              -              j                        )                          ⁡                  [          n          ]                    
wherein xcfx86 (i, j) is the correlation function, x[n] is the input signal, and y(ixe2x88x92j) [n] is the output signal from each filter; and obtaining mel-linear predictive coefficients from the correlation function on the mel-frequency axis; and the input speech is recognized using the mel-linear predictive coefficients, or cepstrum coefficients obtained from the mel-linear predictive coefficients.
According to a thirteenth aspect of the present invention, a speech recognition apparatus comprises: mel-linear predictive analysis means for calculating mel-linear predictive coefficients corresponding to a spectrum envelope having different resolutions for different frequencies, from an input speech, using a weighting function of frequency based on human auditory characteristics; cepstrum coefficient calculation means for calculating cepstrum coefficients from the mel-linear predictive coefficients obtained by the mel-linear predictive analysis means; and a speech recognition means for calculating distances between plural frames of the cepstrum coefficients and plural standard models or plural standard patterns, and deciding which one of the standard models or patterns is similar to the input speech.