Information Theory
The idea of using an entropy function in order to describe the information content of a system was first introduced by C. E. Shannon in his pioneering work, “A Mathematical Theory of Communication”, Bell System Technical Journal, 27, 379-423; 623-656 (1948). Shannon showed that a definition of entropy similar in form to a corresponding definition in statistical mechanics could be used to measure the information gained from the selection of a specific event among an ensemble of possible events. Shannon's entropy function can be represented as:       H    ⁡          (                        p          1                ,                              p            2                    ⁢                                           ⁢          …          ⁢                                           ⁢                      p            n                              )        =            ∑              k        =        1            n        ⁢                  p        k            ⁢      ln      ⁢                           ⁢              p        k            where pk represents the probability of occurrence for the k'th event, and uniquely satisfies the following three conditions:    1. H(p1, . . . ,pn) is a maximum for pk=1/n for k=1, . . . ,n. This implies that a uniform probability distribution possesses the maximum entropy. In addition, Hmax(1/n,1/n, . . . ,1/n)=In n. Therefore, the entropy of a uniform probability distribution scales logarithmically with the number of possible states;    2. H(AB)=H(A)+HA(B) where A and B are two finite schemes. H(AB) represents the total entropy of schemes A and B and HA(B) is the conditional entropy of scheme B given scheme A. When the two scheme distributions are mutually independent, HA(B) H(B);    3. H(p1,p2, . . . ,pn,0)=H(p1,p2, . . . ,pn). Any event with zero probability of occurrence in a scheme does not change the entropy function.
Shannon's work was directed to describing the information content of one-dimensional electrical signals. In his book Physics from Fisher Information: A Unification, Cambridge University Press, 1998, Roy Friedan describes the “Shannon entropy” as a global information measure across an entire data set. An alternative informational measure, known as “Fisher entropy”, is also described by Friedan as a measurement of local information across a data set. For mathematical modeling, Friedan has recently shown that Fisher entropy is particularly well suited to discover physical laws.
More recently, T. Nishi has used the Shannon entropy function to define a normalized “informational entropy” function, which can be applied to any data set. See: Hayashi, T. and Nishi, T., “Morphology and Physical Properties of Polymer Alloys”, Proceedings of the International Conference on ‘Mechanical Behaviour of Materials VI’, Kyoto, 325, 1991. See also: Hayashi, T., Watanabe, A., Tanaka, H., and Nishi, T., “Morphology and Physical Properties of Three-Component Incompatible Polymer Alloys”, Kobunshi Ronbunshu, 49 (4), 373-82, 1992.
Nishi's definition can be summarized as follows: Consider a data set D={d1, . . . ,dn} with n data elements. If the sum of all the elements dtot is defined as             d      tot        =                  ∑                  i          =          1                n            ⁢              d        i              ,then dtot can be used to normalize each of the data elements such thatfi=di/dtot∀iε{1, . . . , n}. It is then possible to define an informational entropy function, E:   E  =            (                        ∑          i                ⁢                              f            i                    ⁢          ln          ⁢                                           ⁢                      f            i                              )        /                  ln        ⁡                  (                      1            /            n                    )                    .      
The entropy function E has the useful property that it is normalized between 0 and 1. A perfectly uniform distribution, where fi=1/n results in an E value of 1. As the distribution becomes less uniform, the value of E drops and asymptotically approaches zero. A significant advantage of the Nishi informational entropy function E is that it characterizes the uniformity of any distribution regardless of the shape of the distribution. In contrast, the commonly used “standard deviation” is usually interpreted in standard statistics only for Gaussian distributions.
Prior art methods, such as neural networks, statistical regression, and decision tree methods, have certain inherent limitations. Although neural networks and other statistical regression methods have been used for categorical modeling, they are much better suited and perform better for quantitative modeling, due to the continuous non-linear sigmoid function used within the nodes of the network. Decision trees are best suited for categorical modeling, due to their inability to perform accurate quantitative predictions on continuous output values.