1. Field of the Invention
The present invention generally relates to a technology for speech processing, and specifically relates to speech processing under a background noise environment.
2. Description of the Related Art
In speech recognition under a noise environment, a mismatch of a speech model causes a problem of degrading a recognition performance due to a difference between a noise environment at a time of learning and a noise environment at a time of recognition. One of the effective methods to cope with the problem is a stereo-based piecewise linear compensation for environments (SPLICE) method proposed in Li Deng, Alex Acero, Li Jiang, Jasha Droppo and Xuedong Huang, “High-performance robust speech recognition using stereo training data”, Proceedings of 2001 International Conference on Acoustics, Speech, and signal Processing, pp. 301-304.
The SPLICE method obtains a compensation vector in advance from a pair of clean speech data and noisy speech data in which a noise is superimposed on the clean speech data, and brings a feature vector at a time of the speech recognition close to a feature vector of the clean speech by using the compensation vector. The SPLICE method can also be viewed as a method of noise reduction.
With such a compensation process, it has been reported that a high recognition rate can be achieved even under a mismatch between training conditions and recognition conditions.
However, because the conventional SPLICE method performs a selection of the noise environment in each frame as short as 10 to 20 milliseconds, a different environment may be selected for each frame even when the same environment is continued for a certain period of time, resulting in a degradation of the recognition performance.
Furthermore, the conventional SPLICE method compensates the feature vector only for a single noise environment selected from a number of pre-designed noise environments frame by frame, the noise environment designed in advance does not necessarily match the noise environment at the time of the speech recognition. So a degradation of the recognition performance may be caused by a mismatch of the acoustic model.