Among the various known machine learning algorithms, there exist algorithms referred to as “Boosting.” Here, there will be discussed a learning technique based on a technique referred to as “AdaBoost” which is one of the Boosting algorithms. As for the AdaBoost technique, there exist, for example, a paper by Y. Freund and L. Mason (Y. Freund and L. Mason, “The alternating decision tree learning algorithm”, In Proc. of 16th ICML, pages 124-133, 1999), and a paper by R. E. Schapire and Y. Singer (R. E. Schapire and Y. Singer, “Improved boosting using confidence-rated predictions”, Machine Learning, 37(3): pages 297-336, 1999), and (R. E. Schapire and Y. Singer, “Boostexter: A boosting-based system for text categorization”, Machine Learning, 39 (2/3): pages 135-168, 2000). In the following, Boosting, unless otherwise specified, refers to AdaBoost.
In Boosting, a plurality of weak hypotheses (e.g., rules) are generated from training examples having different weights by using a given weak learner for creating a final hypothesis consisting of the generated weak hypotheses. Each weak hypothesis is repeatedly generated from the training examples while the weights of the examples are changed. Finally, a final hypothesis, which is a combination of the weak hypotheses, is generated. A small weight is assigned to an example which can be correctly classified by the already learned weak hypotheses, and a large weight is assigned to an example which cannot be correctly classified by the already learned weak hypotheses.
The weights of the training examples are updated so as to reduce the upper bound of the training error, which is the number of errors for the training examples. The upper bound of the training error is a value greater than or equal to the actual number of training errors, and is the sum of the weights of the examples in Boosting. The number of training errors itself is lowered by lowering the upper bound of the training error.
A Boosting algorithm that handles a rule learner as the weak learner is used in the present description. Further, in the following, this algorithm will be described as a Boosting algorithm. First, there will be described a simple Boosting algorithm with reference to FIG. 1. First, a training sample S={(x1, y1), (x2, y2), . . . , (xm, ym)} including m examples, each of which is a combination of a feature-set xi that includes one or more features, with a label yi that is either −1 or +1; m initial values w1,i=1 (1≦i≦m) of weights that correspond to the m examples; an iteration frequency N; and a variable t=1 for counting the iteration frequency are set (at S101).
Then, a score (also referred to as gain) of each of the features included in the training sample is calculated according to the weights wt,i of the examples, so that a feature whose score becomes a maximum is extracted as a rule ft (at S103). The wt,i is the weight of the sample number i at round t. The calculation of the scores is performed by using, for example, equation (4) in Formula 6 as will be described below. Note that there is also a possibility that the number of features may be about 100,000, and where the number of examples included in the training sample may also be about 100,000. Thus, it may take considerable time to calculate the scores, but only one feature is selected.
Further, a confidence value ct of the rule ft is calculated by using the weights wt,i of the examples, and then the rule ft and the confidence value ct are stored as the t-th rule and confidence value (at S105). The calculation of the confidence value ct is performed by using, for example, equation (2) in Formula 4 or equation (7) in Formula 9 as will be described below.
Thereafter, new weights wt+1,i (1≦i≦m) are calculated by using the weights wt,i of the examples, the rule ft, and the confidence value ct, and are registered to update the weights (S107). The calculation of the new weights wt+1,i is performed by using, for example, equation (6) in Formula 8 as will be described below.
Then, the value of the variable t is incremented by one (S109). When the value of the variable t is smaller than the iteration frequency N, the processing is returned to S103 (at S111: Yes). On the other hand, when the value of the variable t reaches the iteration frequency N (at S111: No), the processing is ended.
By using the combinations of the rules and the confidence values, which are obtained as a result of the above described processing, it is determined whether the label of a new input is −1 or +1.
As described above, only one combination of the rule and the confidence value can be generated in one iteration. Thus, there is a problem that when the number of features and the number of training examples are increased, the processing time increases enormously.
For this reason, a high-speed version of the Boosting algorithm was considered. This high-speed version of the Boosting algorithm illustrated as FIG. 2 is based on a paper by Sebastiani, Fabrizio, Alessandro Sperduti, and Nicola Valdambrini (“An improved boosting algorithm and its application to text categorization”, In Proc. of International Conference on Information and Knowledge Management, pages 78-85, 2000). First, a training sample S={(x1, y1), (x2, y2), . . . , (xm, ym)} including m examples, each of which is a combination of a feature-set xi that includes one or more features, with a label yi that is either −1 or +1; m initial values w1,i=1 (1≦i≦m) of weights that correspond to the m examples; an iteration frequency N; the number ν of rules learned at one time; and a variable t=1 for counting the iteration frequency are set (at S151). In order to facilitate understanding, there is described an example, in which the processing is performed on a training sample illustrated as FIG. 3. In FIG. 3, three training examples are included. The first training example includes a feature-set in which features of a, b, and c are included, and the label is +1. The weight of the first training example is 1. The second training example includes a feature-set in which features of a, b, c, and d are included, and the label is −1. The weight of the second training example is 1. The third training example includes a feature-set in which features of a, b, and d are included, and the label is +1. The weight of the third training example is 1.
Then, a score (also referred to as gain) of each of the features included in the training sample is calculated according to the weights wt,i of the examples, so that ν features are extracted as rules f′j (1≦j≦ν) in descending order of the scores (at S153). The calculation of the score is performed by using, for example, equation (4) in Formula 6 as will be described below. When the scores are calculated from the data illustrated as FIG. 3, a result illustrated in FIG. 4 is obtained. That is, the scores of the features a and b become “0.414” while the scores of the features c and d become “0”. Here, when ν is set as ν=3, it is assumed that the features a, b, and c are selected.
Then, each confidence value c′j corresponding to the ν number of rules f′j are collectively calculated by using the weights wt,i of the examples (at S155). The calculation of the confidence values c′t is performed by using, for example, equation (2) in Formula 4 or equation (7) in Formula 9 as will be described below. At S155, the ν confidence values c′j are calculated by using the same weights wt,i. In the above description, illustrated as FIG. 5, the confidence values of the rules a and b are calculated to be 0.279, while the confidence value of the rule c is calculated to be 0.
Here, j is initialized to 1 (at S157). Then, new weights wt+1,i (1≦i≦m) are calculated by the weights wt,i of the examples, the rule f′j, and the confidence value c′j, and are registered to update the weights (at S159). The calculation of the new weights wt+1,i is performed by using, for example, equation (6) in Formula 8 as will be described below. In the above described example, the calculation of weight is performed to a rule a. As illustrated in FIG. 6, the weights of the first and third training examples are updated to 0.75, while the weight of the second training example is updated to 1.32. Then, the rule f′j and the confidence value c′j are registered as the t-th rule and confidence value (at S161).
The value of variable t and the value of variable j are respectively incremented by one (at S163), and it is determined whether or not the value of j is equal to or less than the value of ν (at S165). When the value of j is equal to or less than the value of ν, the processing shifts to S159.
When j=2, and when S159 is performed, the weights are calculated for a rule b in the above described example, so that new weights are registered to update the weights used in the calculation illustrated as FIG. 7. That is, the weights of the first and third training examples are updated to 0.56, while the weight of the second training example is updated to 1.74.
Further, when j=3, and when S159 is performed, the weights are calculated for a rule c, so that new weights are registered to update the weights used in the calculation illustrated as FIG. 8. However, since the confidence value of the rule c is 0, FIG. 8 is the same as FIG. 7.
On the other hand, when j exceeds ν, it is determined whether or not t is smaller than the iteration frequency N (at S167). When t<N, the processing returns to S153. The scores are again calculated in S153 so that the values of the scores are obtained as illustrated in FIG. 9. That is, the scores of the rules a and b become “0.26” while the scores of the rules c and d become “0.57”.
On the other hand, when t reaches the iteration frequency N (at S167: No), the processing is ended.
By using the combinations of the rules and the confidence values obtained as a result of the above described processing, it is determined whether the label of a new input is −1 or +1.
By performing the processing illustrated as FIG. 2, a plurality of combinations of the rules and the confidence values can be generated by one iteration, and hence it is possible to shorten the processing time.