As to signal clustering technique, an acoustic signal is finely divided into each segment, and segments having similar feature are clustered as the same class. By using this technique, in a meeting or a broadcast program including a plurality of participates, an acoustic signal (acquired from the meeting or the broadcast program) is clustered for each speaker. Furthermore, in a video (such as a home video), by distinguishing a background sound at a place where the video is captured, the acoustic signal is clustered for each event or each scene. Hereinafter, one unit including an utterance of the speaker or a specific event is called “a scene”.
As to a conventional technique, in order to characterize each segment divided from an acoustic signal, a plurality of reference models is generated from the acoustic signal to be processed. Then, an observation probability (Hereinafter, it is called “a likelihood”) between each segment and each reference model is calculated. In this case, the reference model is represented by an acoustic feature. Especially, segments (divided signals) belonging to the same scene have a high likelihood for a specific reference model, i.e., a similar feature.
In this conventional technique, when reference models are generated from an acoustic signal comprising scenes having various durations, the number of reference models (representing each scene) depends on a duration of the scene. In other words, a plurality of reference models is often generated based on the scene. Briefly, when duration of a scene is longer, the number of reference models representing the scene becomes larger. Accordingly, if a segment does not have a high likelihood for all reference models representing a specific scene, the segment cannot be clustered to the specific scene. Furthermore, by clustering segments to a scene having a long duration (represented by the large number of reference models), information of another scene having a short duration (represented by the small number of reference models) becomes unnoticeable. As a result, detection of another scene having the short duration is often missed.