There has been a technique for recognizing the emotion etc. of a speaker from the voice of the speaker.
Relating to the technique above, there is an utterance modified speech recognition device having a high recognition performance even when there is a small amount of speech data used in learning an utterance modification model. The utterance modified speech recognition device learns an utterance modification model representing a modification of a phoneme spectrum occurring in the voice having an utterance modification. Then, the utterance modified speech recognition device outputs a standard modified voice model by performing a spectrum modifying process using an utterance modification model on a standard voice model without utterance modifications. Next, the utterance modified speech recognition device performs a recognizing process on an utterance modified voice feature vector time series obtained by performing a sound analysis on an input voice signal using a standard modified voice model and a standard voice model without utterance modifications.
Furthermore, there is a speech recognition system known for recognizing the level of the emotion of a speaker. The speech recognition system includes, for example, a voice analysis unit, a dictionary unit, a acoustic model unit, an utterance modifying emotion model unit, and a voice-emotion recognition unit. Then, the dictionary unit stores a word for speech recognition. The acoustic model unit stores a model for use in the speech recognition. Practically, it stores a acoustic model indicating the correspondence between a character and a phoneme used in the dictionary unit. The utterance modifying emotion model unit stores an utterance modifying emotion model indicating the correspondence between a character and a phoneme used in the dictionary unit when the emotion has changed. The voice-emotion recognition unit stores the level indicating a word in phoneme units and the strength of the emotion.
Then, the speech recognition system compares for the voice analysis result of the input voice analyzed by the voice analysis unit between the acoustic model and the dictionary by phoneme units connected by a model connecting unit, and picks up the most likely word in the dictionary enrolled in the dictionary unit. Furthermore, the speech recognition system selects from the voice-emotion recognition unit the level indicating the strength of the emotion represented by the input voice of the picked up word.
In addition, in the speech recognition devices which recognizes voice by comparing a synthetic voice model to which noise adaptation and speaker adaptation are applied with a feature vector sequence obtained by the uttered voice during the utterance, a speech recognition device capable of reducing the computational complexity when noise adaptation, speaker adaptation, etc. are performed on an initial voice model is well known.
[Patent Document 1] Japanese Laid-open Patent Publication No. 08-211887
[Patent Document 2] Japanese Laid-open Patent Publication No. 11-119791
[Patent Document 3] Japanese Laid-open Patent Publication No. 2004-109464
[Non-patent Document 1] “Speech recognition System” by Kiyohiro Kano, Katsunobu Ito, Tatsuya Kawahara, Kazuya Takeda, and Mikio Yamamoto, and published by Ohmsha
[Non-patent Document 2] “Introduction to Cluster Analysis” by Sadaaki Miyamoto, and published by Morikita Publication
[Non-patent Document 3] Douglas A. Reynolds/Richard C. Rose, “Robust text-independent speaker identification using Guassian mixture speaker models” IEEE Trans. on Speech and Audio Process, vol. 3, no. 1, pp. 72-83 1995
[Non-patent Document 4] Douglas A. Reynolds/Thomas F. Quatieri/Robert B. Dunn, “Speaker verification using adapted Gaussian Mixture models”, Digital Signal Processing, vol. 10, pp. 19-41 2000