Non-patent literature 1 describes an example of a technique of sorting a plurality of sound segments formed from the voices of a plurality of speakers for each speaker. In non-patent literature 1, first, all sound segments are defined as different clusters. Then, merging a pair of similar clusters is successively repeated, thereby clustering the sound segments. Whether to merge two clusters is determined by modeling the two clusters before and after merge and comparing the BIC (Bayesian Information Criterion) of the two models. The technique of non-patent literature 1 uses a model assuming that the feature amounts of samples included in each cluster comply with a single Gaussian distribution. As the feature amount, for example, an MFCC (Mel-Frequency Cepstrum Coefficient) often used in speech recognition is used. At this time, BIC for a given clustering result (c1, c2, . . . , cK) is represented by
                              [                      Mathematical            ⁢                                                  ⁢            1                    ]                ⁢                                                                                                BIC          ⁡                      (                                          c                1                            ,                              c                2                            ,              …              ⁢                                                          ,                                                c                  K                                ;                K                                      )                          =                              -                                          ∑                                  k                  =                  1                                K                            ⁢                              log                ⁢                                                                  ⁢                                  P                  ⁡                                      (                                                                                            X                          k                                                |                                                  μ                          k                                                                    ,                                              Σ                        k                                                              )                                                                                +                                    λ              ·                              K                2                                      ⁢                          (                              d                +                                                      d                    ⁡                                          (                                              d                        +                        1                                            )                                                                            2                    ⁢                                                                                                                            )                        ⁢            log            ⁢                                                  ⁢            N                                              (        1        )            
where K is the number of clusters, P(Xk|μk, Σk) is the likelihood of the samples included in the kth cluster, λ is the penalty coefficient which is normally 1, d is the number of dimensions of the feature amount, and N is the total number of samples. The first term represents the goodness of fit of the samples to the model. The second term represents the penalty to the complexity of the model. The penalty increases as the number of clusters increases. The smaller the value of BIC is, the more preferable the model is. In general, when the model becomes more complex, the goodness of fit (likelihood) of the samples increases. Since the BIC gives a penalty to the complexity of a model, a model having appropriate complexity can be selected.
Merging two clusters when a change amount ΔBIC of the BIC upon merging the two clusters satisfies ΔBIC<0 is repeated, thereby performing clustering. Let X1 be the set of samples included in a cluster c1, and X2 be the set of samples included in a cluster c2. when the two clusters are merged, the change amount ΔBIC of the BIC is given by
                              [                      Mathematical            ⁢                                                  ⁢            2                    ]                ⁢                                                                                                Δ          ⁢                                          ⁢          BIC                =                              log            ⁢                                                  ⁢                                                            P                  ⁡                                      (                                                                                            X                          1                                                |                                                  μ                          1                                                                    ,                                              Σ                        1                                                              )                                                  ·                                  P                  ⁡                                      (                                                                                            X                          2                                                |                                                  μ                          2                                                                    ,                                              Σ                        2                                                              )                                                                              P                ⁡                                  (                                                            X                      1                                        ,                                                                  X                        2                                            |                      μ                                        ,                    Σ                                    )                                                              -                                    λ              2                        ⁢                          (                              d                +                                                      d                    ⁡                                          (                                              d                        +                        1                                            )                                                        2                                            )                        ⁢            log            ⁢                                                  ⁢            N                                              (        2        )            
where P(X1|μ1, Σ1) and P(X2|μ2, Σ2) are the likelihood of the samples included in the cluster c1 and the likelihood of the samples included in the cluster c2, and P(X1, X2|μ, Σ) is the likelihood of the samples when the two clusters are merged.
Cluster merging is successively repeated in this way. The cluster merging ends when ΔBIC≧0 holds in every cluster pair merging. The number of clusters is thus automatically determined.
Patent literature 1 describes a technique of analyzing an input video, sorting image segments and sound segments, and associating an image segment and a sound segment, which include the same object, with each other based on the similarity of the segments. In patent literature 1, a feature amount is calculated for each of the image segments and sound segments of the input video. The image segments or sound segments are input into groups. The obtained image segment groups and sound segment groups are associated based on the temporal simultaneity of them. As a result, groups of sound segments and image segments sorted for each object are obtained.