In order to support writing of the minutes of a meeting, it is required that utterances included in speech recorded at the meeting are clustered into each speaker. Many techniques to cluster utterances into each speaker by using an acoustic feature extracted from speech at the meeting are already reported. For example, a similarity between an acoustic feature of each utterance and many speaker models previously trained is calculated, and each utterance is clustered into each speaker based on a pattern of the similarity.
However, in above-mentioned method using acoustic feature only, if quality of speech at the meeting drops such as a background noise being included, each utterance cannot be correctly clustered.