In recent years, fraud and other malicious solicitations conducted using telephones with an aim to defraud people out of money have become a social problem. To address this, techniques have been proposed for estimating a speaker's state of mind by analyzing the speaker's voice produced, for example, during a conversation being conducted over a telephone line (For example, refer to Japanese Laid-open Patent Publication No. 2011-242755).
Such techniques work based on the premise that a speech signal containing speech uttered by one particular speaker whose state of mind is to be analyzed can be analyzed. However, a speech signal produced by recording a conversation contains speech from two or more speakers. In order to estimate the state of mind of one particular speaker with good accuracy based on a speech signal produced by recording a conversation, there is a need to identify from within the speech signal a speech segment in which the speaker whose state of mind is to be estimated is speaking. In view of this, speaker indexing techniques are proposed that can append speaker identification information to the speech segment of each speaker in a monaural speech signal containing speech from a plurality of speakers (refer, for example, to Japanese Laid-open Patent Publication No. 2008-175955 and to Fredouille et al., “The LIA-EURECOM RT'09 Speaker Diarization System”, NIST paper of “The rich transcription 2009 Meeting recognition evaluation workshop”, 2009 (hereinafter referred to as non-patent document 1)).
For example, the indexing device disclosed in Japanese Laid-open Patent Publication No. 2008-175955 computes the degree of similarity between acoustic models created from speech features extracted at predetermined intervals of time, and creates acoustic models from the speech features within a range where the degree of similarity is equal to or greater than a predetermined value. Then, the indexing device, using the acoustic models and the speech features within the range, derives a feature vector that characterizes the speech signal for each second segment, and classifies such feature vectors on a speaker-by-speaker basis.
On the other hand, the speaker diarization system disclosed in non-patent document 1 first trains a Gaussian mixture model by using a single state hidden Markov model (HMM) for every speech segment contained in a speech signal. Then, the system repeats the speaker labeling and retraining for each speech segment by adding to the state of the HMM trained by using the features of the speech segment in which the same speaker is highly likely to be speaking.