Various machine learning algorithms are known including machine learning algorithms known as “boosting” algorithms. A learning method based on an AdaBoost method, which is a type of boosting algorithm, is outlined below. Hereinafter, unless otherwise described, boosting refers to AdaBoost.
The related documents in this field include the following: Y. Freund and L. Mason, 1999, “The alternating decision tree learning algorithm”, In Proc. of 16th ICML, pp. 124-133; R. E. Schapire and Y. Singer, 1999, “Improved boosting algorithms using confidence-rated predictions”, Machine Learning, 37 (3): 297-336; R. E. Schapire and Y. Singer, 2000, “Boostexter: A boosting-based system for text categorization”, Machine Learning, 39 (2/3): 135-168; and Gerard Escudero, Llu'is M'arquez, and German Rigau, 2000, “Boosting applied to word sense disambiguation”, In Proc. of 11th ECML, pp. 129-141.
In the boosting, a plurality of weak hypotheses (i.e., rules) are generated from training examples with different weights with the use of a given weak learner. While changing the weight of the training example, the weak hypotheses are repeatedly generated from the training examples, and thus a final hypothesis, which is a combination of the weak hypotheses, is finally generated. A small weight is assigned to a case which can be correctly classified by a previously learned weak hypothesis, while a large weight is assigned to a case which cannot be correctly classified.
This description is based on a boosting algorithm using a rule learner as the weak learner. Hereinafter, such an algorithm is described as a boosting algorithm. The premise of the boosting algorithm will be hereinafter described.
First, a problem addressed by the boosting algorithm will be described. Here, x is assumed to be a set of examples, and a treated label set is assumed to be y={−1, +1}. The object of learning is to derive a mapping F: x→y from learning data S={(x1, y1), . . . , (xm, ym)}.
Here, |x| is assumed to be a kind of feature included in a case xεx. xiεx (1≦i≦m) is assumed to be a feature set comprising |xi| kinds of features. The feature set comprising “k” features is described as “k−feature set”. Further, yiεy is a class level of the i-th feature set of S.
FT={f1, f2, . . . , fM} is assumed to be “M” kinds of features which are the objects of the boosting algorithm. Each feature of each case xi is xi,jεFT (1≦j≦|xi|). The boosting algorithm can handle a binary vector; however, in the following example, each feature is represented by a character string.
A case where a feature set includes another feature set is defined as follows:
Definition 1:
In two feature sets x and x′, when x′ has all features of x, x is called a partial feature set of x′, and is described as follows:x⊂x′
Further, the rule is defined based on the concept of real-valued prediction and abstaining (RVPA) used in “Boostexter: A boosting-based system for text categorization”, Machine Learning, 39 (2/3): 135-168, 2000 by R. E. Schapire and Y. Singer. In RVPA, when an input feature set fits the conditions, a confidence value represented by a real number is returned; but when an input feature set does not fit the conditions, “0” is returned. The weak hypothesis for classification of the feature sets is defined as follows:
Definition 2:
A feature set “f” is a rule, and “x” is the input feature set. When a real number “c” is the confidence value of the rule “f”, the application of the rule is defined as follows:
            h              〈                  f          ,          c                〉              ⁡          (      x      )        =      {                            c                                      f            ⊆            x                                                0                          otherwise                    
In the rule learning based on the boosting, a combination of “T” kinds of rule feature sets and the confidence value (<f1, c1>, . . . , <fT, cT>) are obtained by learning using the weak learner in “T” number of Boosting rounds, and thus “F” is defined as follows:
      F    ⁡          (      x      )        =      sign    (                  ∑                  t          =          1                T            ⁢                        h                      〈                                          f                                  t                  ,                                            ⁢                              c                t                                      〉                          ⁡                  (          x          )                      )  
wherein, when “x” is not less than 0, sign (x) represents a function of 1, and in other cases the sign (x) represents a function of −1.
The weak learner derives a rule “ft” and the confidence value “ct” with the use of the learning data S {(x1, y1) (1≦i≦m) and the weight {wt,1, . . . , wt,m} of each training example at the “t”-th boosting round, wt,1(0<wt,1) is the weight of the “t”-th (1≦t≦T) boosting round in the “i”-th (1≦i≦m) case (x1, y1).
The weak learner selects, as the rule, the feature set “f” and the confidence value “c” minimizing the following formula based on the given learning data and the weight of the training example:
                                                        ∑                              y                ∈                                  {                                                            -                      1                                        ,                                          +                      1                                                        }                                                      ⁢                                                            W                                      t                    ,                    y                                                  ⁡                                  (                  f                  )                                            *                              exp                ⁡                                  (                                                            -                      y                                        *                                                                  h                                                  〈                                                      f                            ,                            c                                                    〉                                                                    ⁡                                              (                                                  x                          i                                                )                                                                              )                                                              +                                    W              t                        ⁡                          (                              ⫬                f                            )                                      ⁢                                  ⁢                                            W                              t                ,                y                                      ⁡                          (              f              )                                =                                    ∑                              i                =                1                            m                        ⁢                                          w                                  t                  ,                  i                                            ⁡                              [                                  [                                                            f                      ⊆                                                                        x                          i                                                ⋀                                                  y                          i                                                                                      =                    y                                    ]                                ]                                                    ⁢                                  ⁢                                            W              t                        ⁡                          (                              ⫬                f                            )                                =                                                    ∑                                  i                  =                  1                                m                            ⁢                              w                                  t                  ,                  i                                                      -                                          W                                  t                  ,                                      +                    1                                                              ⁡                              (                f                )                                      -                                          W                                  t                  ,                                      -                    1                                                              ⁡                              (                f                )                                                                        (        1        )            
wherein, when a proposition π is satisfied, [[π]] is 1, and [[π]] is 0 otherwise.
The formula (1) is used as the reference of selection of the rules, because the upper bound of training error of the learning algorithm based on the boosting is associated with the sum of the weights of the examples.
When the formula (1) is minimized by a certain rule “f”, the confidence value “c” at that time is as follows:
                    c        =                              1            2                    ⁢                      ln            ⁡                          (                                                                    W                                          t                      ,                                              +                        1                                                                              ⁡                                      (                    f                    )                                                                                        W                                          t                      ,                                              -                        1                                                                              ⁡                                      (                    f                    )                                                              )                                                          (        2        )            
The formula (2) is substituted into the formula (1), whereby the following formula is obtained:
                                          ∑                          i              =              1                        m                    ⁢                      w                          t              ,              i                                      -                              (                                                                                W                                          t                      ,                                              +                        1                                                                              ⁡                                      (                    f                    )                                                              -                                                                    W                                          t                      ,                                              -                        1                                                                              ⁡                                      (                    f                    )                                                                        )                    2                                    (        3        )            
Based on the formula (3), the minimization of the formula (1) is understood to be equivalent to selecting the feature set “f” maximizing score to be defined as follows:
                              score          ⁡                      (            f            )                          ⁢                  =          def                ⁢                                                                                        W                                      t                    ,                                          +                      1                                                                      ⁡                                  (                  f                  )                                                      -                                                            W                                      t                    ,                                          -                      1                                                                      ⁡                                  (                  f                  )                                                                                                  (        4        )            
Next, a processing for updating the weight of each case with the use of (ft, ct) will be described. There are two cases where the weight is normalized so that the sum of all weights is 1, and where the weight is not normalized.
When the weight is normalized, a weight wt+1, i in the “t+1”-th round is defined as follows:
                                          w                                          t                +                1                            ,              i                                =                                                    w                                  t                  ,                  i                                            ⁢                              exp                ⁡                                  (                                                            -                                              y                        i                                                              ⁢                                                                  h                                                  〈                                                                                    f                                                              t                                ,                                                                                      ⁢                                                          c                              t                                                                                〉                                                                    ⁡                                              (                                                  x                          i                                                )                                                                              )                                                                    Z              t                                      ⁢                                  ⁢                              Z            t                    =                                    ∑                              i                =                1                            m                        ⁢                                          w                                  t                  ,                  i                                            ⁢                              exp                ⁡                                  (                                                            -                                              y                        i                                                              ⁢                                                                  h                                                  〈                                                                                    f                                                              t                                ,                                                                                      ⁢                                                          c                              t                                                                                〉                                                                    ⁡                                              (                                                  x                          i                                                )                                                                              )                                                                                        (        5        )            
When the weight is not normalized, the weight wt+1, i in the “t+1”-th round is defined as follows:wi+1,i=wt,iexp(−yihft,ct(xi)  (6)
An initial value w1, i of the weight to be normalized is “1/m” (where m is the number of the training examples), and the initial value w1,i of the un-normalized weight is 1.
When the appearances of the feature are sparse (the feature appears in few examples), Wt,+1(f) or Wt,−1(f) is a very small value or 0. In order to avoid that, a value ε for smoothing is introduced.
Namely, the formula (2) is transformed as follows:
                    c        =                              1            2                    ⁢                      ln            ⁡                          (                                                                                          W                                              t                        ,                                                  +                          1                                                                                      ⁡                                          (                      f                      )                                                        +                  ɛ                                                                                            W                                              t                        ,                                                  -                          1                                                                                      ⁡                                          (                      f                      )                                                        +                  ɛ                                            )                                                          (        7        )            
For example, ε=1/m or ε=1 may be used.
In the basic boosting described above, when the number of candidates of the rules (that is, the number of features) and the generation frequency of the rules (that is, the number of rounds of repetition processing) are large, the learning time becomes very long, leading to a problem.
Therefore, a method where learning is performed using only a part of the rule candidates has been considered. For example sets of rule candidates (also called buckets) are previously generated on the scale of frequency and entropy, and one rule is selected from one set in each round. Hereinafter, the processing contents of this method will be described using FIGS. 1 to 11.
First, the learning data “S” including “m” number of examples, which are combinations of the feature set “xi” including one or more features and a label “yi” of −1 or +1: S={(x1, y1), (x2, y2), . . . , (xm, ym)}, an initial value DI(i)=1 (1≦i≦m) of “m” number of weights corresponding to “m” number of examples, an iteration frequency “N”, a variable I=1 for counting the frequency of iterations, the number of buckets “M” (set of rule candidates), and a variable b=1 (1≦b≦m) of a bucket ID are set (at S101). In order to promote understanding, an example of processing the learning data in FIG. 2 will be described. FIG. 2 includes four training examples. The first training example includes the feature set including features “a”, “b”, and “c” and a label +1, and the weight of the first training example is 1. The second training example includes the feature set including features “c” and “d” and the label −1, and the weight of the second training example is 1. The third training example includes the feature set including features “a” and “c” and the label +1, and the weight of the third training example is 1. The fourth training example includes the feature set including features “a” and “b” and the label +1, and the weight of the fourth training example is 1.
Next, the features included in the learning data “S” are extracted as the rule candidates. The weight of each of the features is calculated from the weight of the associated training example, and the features are distributed to “M” number of buckets (B[1], . . . B[M]) based on the weight of the relevant feature (at S103). The feature “a” is included in the feature set in the first, third and fourth training examples. The weights in those training examples are added to each other, whereby the weight of the feature “a” equals 3. Likewise, the feature “b” is included in the feature set in the first and fourth training examples. The weights in those training examples are added to each other, whereby the weight of the feature “b” equals 2. The feature “c” is included in the feature set in the first, second, and third training examples. The weights in those training examples are added to each other, whereby the weight of the feature “c” equals 3. The feature “d” is included in the feature set in the second training examples. The weights in those training examples are added to each other, whereby the weight of the feature “d” equals 1. These results are compiled so that the features and the weights of the features depicted in FIG. 3 are obtained. The features are sorted in descending order based on the weights of the features, and the result depicted in FIG. 4 is obtained. Thus, the order of “a”, “c”, “b”, and “d” is obtained. If M=2, the features “a”, “c”, “b”, and “d” are alternately distributed to the buckets 1, 2, 1, and 2 respectively. Thus, as depicted in FIG. 5, the bucket 1 includes the features “a” and “b”, and the bucket 2 includes the features “c” and “d”.
Next, a gain of the rule candidates (that is, the features) included in a bucket B[b] is calculated according to a weight “DI(i)”, and the rule candidate with a maximum gain value is selected as a rule “hI” (at S105). The gain is defined as follows with respect to a rule candidate “f”:gain(f)=|sqrt(W(f,+1),sqrt(W(f,−1))|
Here, “W(f, LABEL)” is the sum of the training examples, where the rule candidate “f” appears and the “LABEL” is +1 or −1. “sqrt(x)” represents x1/2, and |x| represents an absolute value of “x”.
For example, when the rule candidates “a” and “b” included in the bucket 1 are processed, according to FIG. 2, gain(a)=|sqrt(3)−sqrt(0)|=31/2. Likewise, gain(b)=|sqrt(2)−sqrt(0)|=21/2. These results are compiled as depicted in FIG. 6. Thus, the rule candidate “a” with a gain larger than the gain of the rule candidate “b” is selected as the rule “hI”.
Next, a confidence value “αI” of the rule “hI” is calculated using the weight “DI(i)”, and the rule “I” and the confidence value “αI” are registered in a rule data storage unit (at S107). The confidence value “αI” is calculated based on the formula (7); however, c=αI. For example, the confidence value of the rule “a” is calculated to be “1.28”.
Further, the weight “DI(i)” is updated to a weight “DI+1(i)” based on the rule “hI” and the confidence value “αI” (at S109). The weight for the next stage is calculated by the formula (5) or (6). When the formula (6) is used, the weights depicted in FIG. 2 are updated to the weights depicted in FIG. 7. Also, wt,i=DI(i). The weights in the first, third, and fourth training examples are updated to 0.27.
Then, “I” is incremented by one (at S111), and “b” is incremented by one (at S113). However, when “M” is less than “b”, “b” is returned to 1.
Thereafter, whether “I” is smaller than “N” or not is judged (at S115). If “I” is smaller than “N”, the processing returns to S105. Meanwhile, if “I” is larger than “N”, the processing is terminated.
In the above example, after shifting to the processing of B[2], when each gain of the rule candidates “c” and “d” included in the bucket 2 is calculated, the values depicted in FIG. 8 are obtained. Gain(c)=|sqrt(0.54)−sqrt(1)|=0.25, and gain(d)=|sqrt(0)−sqrt(1)|=1. According to this result, the rule candidate “d” is selected as a rule “h2”.
Next, the confidence value of the rule “d” is calculated in accordance with the formula (7), whereby −0.81 is obtained. When the weight of the training example at the next stage is calculated in accordance with the formula (6), using the rule “d” and the confidence value of −0.81, the value depicted in FIG. 9 is obtained. Only the weight of the second training example including the feature “d” is updated to 0.44.
Further, after shifting to the processing of B[1], when each gain of the rule candidates “a” and “b” included in the bucket 1 is calculated, the values depicted ins FIG. 10 are obtained. Also in this case, the feature “a” has a larger gain, and therefore, the feature “a” is selected as the rule. The confidence value of the rule “a” is calculated in accordance with the formula (7), whereby 0.73 is obtained.
According to the above processing, pairs of the rule and the confidence value registered in the rule data storage part are depicted in FIG. 11.
When the learning is finished, and in the classification (that is, in the judgment determining whether a case is −1 or +1), the following processing is performed. Namely, when “a b e” is input as an input example, the sum of the confidence values=1.28+0.73=2.01 is obtained from the first and third records of FIG. 11. When the sum of the confidence values is positive, “abe” is classified as +1.
Meanwhile, when “d e” is input as the input example, the sum of the confidence values=−0.81 is obtained from the second record of FIG. 11. When the sum of the confidence values is negative, “de” is classified as −1.