Field of the Invention
The present invention relates to a speech processing device, a speech processing method, and a speech processing program.
Description of Related Art
A sound emitted in a room is repeatedly reflected by walls or installed objects which cause reverberations. When reverberations are added, frequency characteristics vary from those of an original speech, and thus a speech recognition rate may decrease. In addition, since previously-uttered speech overlaps with currently-uttered speech, an articulation rate may decrease. Therefore, reverberation reducing techniques of reducing reverberation components from speech recorded under reverberation environments have been developed.
For example, Japanese Patent Publication No. 4396449 (Patent Document 1) describes a dereverbing method of acquiring a transfer function of a reverberation space using an impulse response of a feedback path adaptively identified by an inverse filter processing unit and reconstructing a sound source signal by dividing a reverberation speech signal by the magnitude of the transfer function. In the dereverbing method described in Patent Document 1, the impulse response of reverberations is estimated, but since the reverberation time ranges from 0.2 to 2.0 seconds which is relatively long, the computational load excessively increases and a processing delay becomes remarkable. Accordingly, application to speech recognition has not been spread.
R. Gomez and T. Kawahara, “Optimization of Dereverberation Parameters based on Likelihood of Speech Recognizer”, INTERSPEECH, Speech & Language Processing, International Speech Communication Association, 2009, 1223-1226 (Non-patent Document 1) and R. Gomez and T. Kawahara, “Robust Speech Recognition based on Dereverberation Parameter Optimization using Acoustic Model Likelihood”, IEEE Transactions on Audio, Speech & Language Processing, IEEE, 2010, 18(7), 1708-1716 (Non-patent Document 2) describe methods of calculating a correction coefficient for each frequency band based on likelihoods calculated using an acoustic model and training the acoustic model. In these methods, components of the frequency bands of speech recorded under reverberation environments are corrected using the calculated correction coefficients and speech recognition is performed using the trained acoustic model.
However, in the methods described in Non-patent Documents 1 and 2, when the positional relationship between a sound source and a sound collection unit is different from that used to determine the correction coefficients or the acoustic model, the reverberation component cannot be appropriately estimated from the recorded speech, and thus the reverberation reduction accuracy might decrease. For example, when a sound source is an utterer, a sound volume of speech recorded by the sound collection unit varies due to movement, and thus the estimation accuracy of the reverberation component might decrease.