Terms used in this specification will be defined.
A “sound pressure of a speech” is defined to be a rise in ambient pressure when the speech is present. The sound pressure is expressed in units of [N/m2]. This amount is proportional to the square root of energy of the speech and an amplitude value of the waveform of the speech.
A “sound pressure level” is defined to be a logarithmic measure indicating a ratio of the sound pressure of a target speech relative to a reference sound pressure. The sound pressure level is expressed in units of [dB]. Specifically, the sound pressure level is expressed by the following Expression (1):Sound Pressure Level=20 log 10(Sound Pressure of Target Speech/Reference Sound Pressure)  (1)
A “gain” is defined to be a ratio between the sound pressure of an output signal and the sound pressure of an input signal. Specifically, the gain is defined by the following Expression (2):Gain=Sound Pressure of Output Signal/Sound Pressure of Input Signal  (2)
The purpose of a gain control apparatus used for normal audio communication or audio recording is to transform an input signal to a sound pressure that is easy to hear by a human being and then output the transformed sound pressure.
On contrast therewith, the purpose of a gain control apparatus used for speech recognition is to match the sound pressure of an input signal to the sound pressure of a speech model prepared for in advance by training.
As described in Non-patent Document 1, a speech model is obtained by converting a speech spectrum into a feature, and the probability of the feature is expressed by a probability model such an HMM (Hidden Markov Model) or a GMM (Gaussian Mixture Model).
FIG. 9 is a diagram showing a configuration of a gain control apparatus used for audio communication or audio recording, described in Patent Document 1. The gain control apparatus in FIG. 9 includes an input signal acquisition unit 1 to which an audio signal is supplied, a plurality of band division filter means 11 that pass only signals of mutually different frequency bands that have been set in advance, absolute value converting means 12 that convert respective signals output from the band division filter means 11 into absolute values, respectively, weighting data storage means 13 for storing therein data for weighting for each frequency band divided by the band division filter means 11, a multiplier unit 14 that multiplies the absolute values obtained by the conversion by the weighting data, an adder unit 15 that sums the weighted values, gain compensation means 16 for compensating the summed value, threshold level storage means 17 that stores therein threshold level data to be compared with the compensated value, comparison means 18 for comparing the compensated value with the threshold level data, gain generation means 19 for generating a gain value based on the result of comparison by the comparison means 18, envelope generation means 20 for smoothing a variation of the gain value, and a sound pressure compensation unit 7 that multiplies the input signal by the gain value output from the envelope generation means 20, thereby performing sound pressure level compensation. With this arrangement, the input signal can be weighted for each frequency band that is highly likely to be a speech, and gain control is performed, based on weighted signals.
In the configuration in FIG. 9, however, a constant sound pressure is output without consideration of a sound pressure difference for each phoneme, as shown in FIG. 10. Thus, an unnatural speech may be produced.
For example, generally, vowel sounds may have large sound pressures, while consonants may have small sound pressures. The configuration in FIG. 9 does not take into consideration a difference between the sound pressures of the vowel sounds and the consonants, and produces the constant sound pressure. Thus, the speech is output where consonants are excessively emphasized.
When the gain control apparatus is combined with a speech recognition apparatus that needs sound pressure information, this constant sound pressure output becomes a great problem, leading to deterioration in recognition performance.
Further, when the frequency band of noise is superimposed on the frequency band of a target speech, there is also a problem that the noise is emphasized.
Next, a gain control apparatus used for speech recognition will be described. In normal speech recognition, in order to obtain the recognition which is robust to variations in sound pressure, the recognition is performed without using a zeroth cepstrum component or a power feature that depends on the sound pressure.
However, in an approach to adapting a speech model to noise, like a PMC (Parallel Model Combination) method known as a speech recognition approach effective under noisy circumstances, information on the zeroth cepstrum component that depends on the sound pressure becomes necessary. Thus, a gain control method is needed (refer to Non-patent Document 2).
FIG. 11 illustrates a PMC method, which is an example of synthesizing a noise adapted model using a clean acoustic model (clean speech model) and a noise model.
By applying an inverse cosine transform and an exponential transform to the speech model trained with a clean speech in the cepstral domain in advance, spectral transformation is performed. A clean speech spectrum is thereby obtained.
By applying the inverse cosine transform and the exponential transform to the noise model trained in a silent segment before utterance of the speech, spectral transformation is performed. A noise spectrum is thereby obtained.
Next, the clean speech spectrum is multiplied by a level adjustment coefficient g (also referred to as a “level compensation coefficient”), and then added to the noise spectrum, thereby deriving the noise-adaptive speech spectrum.
Next, logarithmic conversion and a cosine transform are applied to the noise-adaptive spectrum, thereby obtaining a noise-adaptive speech model.
At a time of recognition, the recognition is performed by comparing an input signal with the noise-adaptive speech model.
In the PMC method, by multiplying the level compensation coefficient g, a mixture ratio between the sound pressure of the speech model and the sound pressure of the noise model worked out from the input signal is adjusted.
Accordingly, multiplication of the level adjustment coefficient g may be considered to be a kind of gain control.
In Non-patent Document 3 and Non-patent Document 4, the level adjustment coefficient g is estimated, based on a likelihood maximization criteria.
Specifically, the following methods are provided:    (A) a method of preparing for a plurality of speech models for sound pressures, and selecting a speech model, likelihood of which is maximum; and    (B) a method of regarding a gain value as a variable, and performing estimation repetitively so that the likelihood is maximum for each Gaussian distribution that constitutes a speech model.
In the two methods described above, the gain control is performed, based on the sound pressure of a speech at a time of training. Thus, the gain control that reflects a sound pressure difference for each phoneme may be performed.
However, in the method (A) in which the speech models for the sound pressures are prepared for in advance, it is necessary to prepare for a lot of speech models, in each of which a sound pressure is changed for each of all phonemes so as to perform accurate estimation. Thus, in terms of capacity and the amount of computation, this method costs much.
In the method (B) of regarding the gain as the variable and repetitively performing estimation, there are a problem that the repetitive estimation needs much computation cost and a problem that sound pressure matching is performed based on a completely different when the first set value of the gain differs.
Patent Document 1: JP Patent Kokai Publication No. JP-P-2004-15125A
Non-patent Document 1: Gourong Xuan, Wei Zhang, Peiqi Chai, “EM Algorithms of Gaussian Mixture Model and Hidden Markov Model”, IEEE International Conference on Image Processing ICIP 2001, vol. 1, pp. 145-148. 2001
Non-patent Document 2: M. J. F. Gales and S. J. Young, “Robust Continuous Speech Recognition Using Parallel Model Combination”, IEEE Trans. SAP-4, No. 5, pp. 352-359. September 1996
Non-patent Document 3: Y. Minami and S. Furui, “A Maximum Likelihood Procedure for a Universal Adaptation Method Based on HMM Composition”, IEE ICASSP'95, 129-132. 1995
Non-patent Document 4: Kenji Takada and Jun Toyama, “Word Recognition Using the HMM Composition Method Which Suits a Signal-to-Noise Ratio Automatically”, IEICE Technical Report, SP2002-97 pp. 19-24, 2002
Non-patent Document 5: Richard O. Duda, Petter E. Hart, David G. Stork, supervised/translated by Morio Onoue, “Pattern Classification”, John Willey & Sons. Singijutu Communications, pp. 528-529
Non-patent Document 6: “Suppression of Acoustic Noise in Speech Using Spectral Substration,” IEEE Trans. ASSP 27, pp. 113-120, 1979