1. Field of the Invention
The present invention relates to techniques for extracting features from speech signals. More particularly, the present invention relates to a technique for extracting delta and delta-delta features robust to reverberation, noise, and the like.
2. Description of Related Art
The noise robustness and reverberation robustness of speech recognition apparatuses have been continuously improved. However, the recognition accuracy under hard conditions has not been sufficiently improved yet. Regarding the noise robustness, it is known that the recognition rate is extremely low under, for example, conditions in which the S/N ratio is extremely low, such as driving a vehicle at a high speed, with the window being open, and unsteady noise conditions, such as music and bustle. Moreover, regarding the reverberation robustness, it is known that the recognition rate is extremely low in places in which much sound reflection and reverberation occur, such as a concrete corridor and an elevator hall, even with little noise.
Various solutions to these problems having been hitherto examined can be classified into the following four types: (1) a front end method for removing reverberation, noise, and the like by preprocessing observed signals (for example, refer to Japanese Unexamined Patent Application Publication No. 2009-58708 and Japanese Unexamined Patent Application Publication No. 2004-347956), (2) a multi-style training method in which an acoustic model is learned, using sounds including reverberation, noise, and the like (for example, refer to Japanese Unexamined Patent Application Publication No. 2007-72481), (3) an adaptation method for transforming features or an acoustic model so that observed sounds match the acoustic model (for example, refer to Japanese Unexamined Patent Application Publication No. 2007-279444), and (4) a feature extraction method in which features robust to reverberation, noise, and the like are used. An example is provided by Takashi Fukuda, Osamu Ichikawa, Masafumi Nishimura, “Short- and Long-term Dynamic Features for Robust Speech Recognition”, Proc of 10th International Conference on Spoken Language Processing (ICSLP 2008/Interspeech 2008), pp. 2262-2265, September 2008, Brisbane, Australia.
Each of the aforementioned methods can be combined with another method. For example, a combination may be considered, in which the methods (2), (3), and (4) are combined, LDA is used as feature extraction, an acoustic model is created by multi-style training, and then adaptation is made by MLLR. Thus, it is important to improve not only one of the aforementioned methods but each of the methods (1) to (4).