The following relates to the information processing arts and related arts.
The softmax function is the extension of the sigmoid function for more than two variables. The softmax function has the form:
                              y          n                =                                            ⅇ                              x                n                                                                    ∑                                  k                  =                  1                                K                            ⁢                              ⅇ                                  x                  k                                                              .                                    (        1        )            The denominator is a sum-of-exponentials function of the form:
                                          ∑                          k              =              1                        K                    ⁢                      ⅇ                          x              k                                      ,                            (        2        )            where xkεxεK. The softmax function finds application in neural networks, classifiers, and so forth, while the sum-of-exponentials function finds even more diverse application in these fields as well as in statistical thermodynamics (for example, the partition function), quantum mechanics, information science, classification, and so forth. For some applications a log of the sum-of-exponentials function is a more useful formulation.
One application of the softmax function is in the area of inference problems, such as Gaussian process classifiers, Bayesian multiclass logistic regression, and more generally for deterministic approximation of probabilistic models dealing with discrete variables conditioned on continuous variables. Such applications entail computing the expectation of the log-sum-of-exponentials function:
                              γ          =                                    E              Q                        [                          log              ⁢                                                          ⁢                                                ∑                                      k                    =                    1                                    K                                ⁢                                  ⅇ                                                            β                      k                      T                                        ⁢                    x                                                                        ]                          ,                            (        3        )            where EQ denotes the expectation for a distribution Q(β) which is the probability density function (pdf) of a given multidimensional distribution in d×K and x is a vector of d. The expectation can be computed using Monte Carlo simulations, but this can be computationally expensive. Taylor expansion techniques are also known, but tend to provide skewed results when the variance of the pdf Q(β) is large.
Another known approach for computing the expectation is to use an upper bound. In this approach an upper bound on the log-sum-of-exponentials function is identified, from which an estimate of the expectation is obtained. The chosen upper bound should be tight respective to the log-sum-of-exponentials function, and should be computationally advantageous for computing the expectation. For the log-sum-of-exponentials function, a known upper bound having a quadratic form is given by:
                                          log            ⁢                                          ∑                                  k                  =                  1                                K                            ⁢                              ⅇ                                  x                  k                                                              ≤                                          ⁢                                                    ∑                                  k                  =                  1                                K                            ⁢                                                (                                                            x                      k                                        -                                          χ                      k                                                        )                                2                                      -                                          1                K                            ⁢                              (                                                      ∑                                          k                      =                      1                                        K                                    ⁢                                      (                                                                  x                        k                                            -                                              χ                        k                                                              )                                                  )                                      +                                          ∑                                  k                  =                  1                                K                            ⁢                                                                    (                                                                  x                        k                                            -                                              χ                        k                                                              )                                    ⁢                                      ⅇ                                          χ                      k                                                                                                            ∑                                                                  k                        ′                                            =                      1                                        K                                    ⁢                                      ⅇ                                          χ                                              k                        ′                                                                                                                  +                          log              ⁢                                                ∑                                      k                    =                    1                                    K                                ⁢                                  ⅇ                                      χ                    k                                                                                      ,                            (        4        )            See, e.g., Krishnapuram et al., Sparse multinomial logistic regression: fast algorithms and generalization bounds. IEEE Trans. Pattern Anal. Mach. Intell., 27(6):957-68, 2005; Böhning, Multinomial logistic regression algorithm. Annals of the Institute of Statistical Mathematics, 44(9):197-200, 1992. These quadratic bounds are generally tight. However, they use the worst curvature over the space, which can result in inefficient integrating when using the upper bound.
As a result, the use of the softmax function for inference problems with more than two variables has heretofore been computationally difficult or impossible for many practical inference and classification problems.