Under various situations, it is necessary to measure the difference between two Probability Distribution Functions (PDF's). For example, in text independent speaker recognition using Gaussian mixture models (GMM), the classification of a given piece of speech can be done by comparing its GMM model with a set of given GMM models. D. A. Reynolds and R. C. Rose, “Robust Test-Independent Speaker Identification Using Gaussian Mixture Speaker Models,” IEEE Trans. on Speech and Audio Processing, Vo. 3, No. 1, pp. 72–83 (1995). Another scenario is to detect the difference among observation probabilities, again often characterized by GMM, of each state of a continuous Hidden Markov Model (HMM) so that similar states can be merged to simplify the overall model in speech recognition tasks. Q. Huang, Z. Liu, A. Rosenberg, D. Gibbon, B. Shahraray, “Automated Generation of News Content Hierarchy by Integrating Audio, Video, and Text Information,” Proc. of IEEE ICASSP 99, Vol. IV, pp. 3025–28 (Phoenix, March, 1999). Although much needed, there is so far no simple way to measure the distance between two mixture PDF's.
There are three well-known properties of a distance measure, namely non-negativeness, symmetry, and triangular inequality. Let G(x), F(x), and H(x) be three PDF's. Denote D(G,F) as the distance between G(x) and F(x), then the three properties can be formally expressed as:D(G,F)≧0, and D(G,F)=0 iff. G=F  (1)D(G,F)=D(F,G)  (2)D(G,H)+D(H,F)≧D(G,F)  (3)
There are different approaches to measure the difference between two PDF's. We summarize them into three categories. They may or may not satisfy the three distance properties.
The first approach defines the distance in Lr space by                                           D                          L              r                                ⁡                      (                          G              ,              F                        )                          =                                            (                                                ∫                                      x                    ∈                    X                                                  ⁢                                                                                                                                                    G                          ⁡                                                      (                            x                            )                                                                          -                                                  F                          ⁡                                                      (                            x                            )                                                                                                                                      r                                    ⁢                                                                          ⁢                                      ⅆ                    x                                                              )                                      1              /              r                                .                                    (        4        )            where commonly used values of r may be 1 or 2. Although satisfying all three distance properties, DLr is usually computed by numerical methods. Therefore, the computational complexity can easily go out of control with the increasing dimension.
The second approach is the relative entropy or Kullback Leibler distance (KLD). T. M. Cover and J. A. Thomas, Elements of Information Theory (John Wiley & Sons, 1991). It is defined as                                           D            KL                    ⁡                      (                          G              ,              F                        )                          =                              ∫                          x              ∈              Z                                                                      ⁢                                    G              ⁡                              (                x                )                                      ⁢            log            ⁢                                          G                ⁡                                  (                  x                  )                                                            F                ⁡                                  (                  x                  )                                                      ⁢                          ⅆ              x                                                          (        5        )            
It is obvious that the straightforward KLD defined above satisfies only the first property. By extending the original KLD to DKL(G,F)+DKL(F,G), one can force it to meet the symmetry property. Although the third property does not hold, the extended KLD is popular in many applications due to the lack of better alternatives. To compute KLD, different approximation schemes are often employed. For example, data sequences TG and TF can be generated from models G and F and then the average log-likelihood ratio of the sequences with respect to G(x) and F(x) can be used to approximate the extended KLD. That is,                                                         D              Seq                        ⁡                          (                              G                ,                F                            )                                =                                    1              N                        ⁢                          (                                                                                      log                    ⁢                                                                  p                        (                                                                              T                            G                                                    ⁢                                                                                  G                            )                                                                                                                      p                        (                                                                              T                            G                                                    ⁢                                                                                  F                            )                                                                                                                                                                +                                                                        log                    ⁢                                                                  p                        (                                                                              T                            T                                                    ⁢                                                                                  F                            )                                                                                                                      p                        (                                                                              T                            T                                                    ⁢                                                                                  G                            )                                                                                                                                                                            )                                      ,                            (        6        )            where N is the length of the data sequences TG and TF. The performance of DSeq is a function of both the value of N as well as the data generation procedure. The bigger N is, the more reliable the approximation is. At the same time, it makes the estimation more expensive.
The third approach is to compute the distance directly from the respective parameters. Ideally, such a method can achieve at least comparable performance with a precise closed form solution that subsequently leads to a much more efficient computational procedure. Unfortunately, the existing method in this category is capable of handling only simplified cases (or degenerated cases). For example, if G(m1, σ1) and F(m2, σ2) are single Gaussians from two individual PDFs, where m1, m2, σ1, and σ2 are their corresponding means and standard deviations, the extended KLD between G and F in this simplified single mixture case can be computed directly from the model parameters to be                                                         D              P                        ⁡                          (                              G                ,                F                            )                                =                                                    σ                1                2                                            σ                2                2                                      +                                          σ                2                2                                            σ                1                2                                      -            2            +                                          (                                                                            σ                      1                      2                                        +                                          σ                      2                      2                                                                                                  σ                      1                      2                                        ⁢                                          σ                      2                      2                                                                      )                            ⁢                                                (                                                            m                      1                                        -                                          m                      2                                                        )                                2                                                    ,                            (        7        )            ignoring the constant multiple.
Even though the computation of Dp is simple and can be extended to handle Gaussians of higher dimensions, it cannot deal with multiple mixture PDF's. Even with the possibility of simplifying the models (using one Gaussian to approximate multiple Gaussians) so that (7) can be applied, the outcome often indicates that it is not effective. This can be illustrated using a simple example. Consider two GMM's G=⅓*N(−2, 1)+⅔*(1, 1) and F=⅓*N(2, 1)+⅔*N(−1, 1), where N(m, σ) is a Gaussian distribution with mean m and standard deviation σ. Both G and F have two Gaussian components that are obviously distributed quite differently. Hence, the distance between G and F is clearly not zero. To apply (7), both G and F have to be simplified into one mixture Gaussian, denoted by G′(mG, σG) and F′(mF, σF), where the new mean and standard deviation can be derived as the weighted average of the means and standard deviations from their components. This yields the same mean (mG=mF=0) and same standard deviation (σG=σF) for both G′ and F′ which leads to DP(G′, F′)=0. Evidently, the measure derived this way failed to capture the obvious difference between the two original PDF's.
Therefore, there is a need to develop other alternatives that can effectively measure the difference between mixture PDF's directly from their model parameters.