The present invention relates to a system for supporting a user's behavior, and particularly relates to a system which supports a user's behavior by generating a behavioral decision function.
Classification learning is studied as a basic technique of data mining. An object of the classification learning is to output behavior on a certain target, which should be adopted in the future, based on information showing the result of behavior which was adopted to the target in the past (hereinafter, referred to as training data). If this technique is applied, according to the past events, it is possible to suggest the most statistically appropriate (e.g., the number of errors is minimized) behavior to a user to support a user's behavior.
The classification learning can be applied to various technical fields as follows:
(1) Diagnosis in the Medical Field
    Target: test result of patient    Behavior: whether or not a certain treatment should be performed
Training data in this example is information showing whether or not a certain treatment was successful when the treatment was performed in the past on a patient having a certain test result. According to the classification learning, it is possible to predict the appropriateness of a treatment on a future patient based on such training data.
(2) Credit Assessment in the Financial Field
    Target: credit history of applicant for loan    Behavior: whether or not a loan is granted
Training data in this example is information showing whether or not a bond was collectible when a loan was made in the past for an applicant having a certain credit history. According to the classification learning, it is possible to judge whether or not to finance a certain applicant in the future based on such training data.
(3) Topic Classification in a Search Engine
    Target: webpages of news    Behavior: classification into economic, sport, and political fields
Training data in this example is information showing whether or not the classification was appropriate when a certain webpage was classified into a certain field in the past. According to the classification learning, a webpage which will be created in the future can be classified appropriately based on such training data.
In general, an object of such classification learning is to accurately predict behavior to be adopted to a target. In other words, the classification learning aims to minimize the number and probability of errors in behaviors.
However, minimizing the number of errors alone may not be sufficient in some problems. For example, in the case of the above example (1), there is a clear difference between a loss (hereinafter, referred to as a cost) caused as a result of diagnosing a healthy patient with a disease and then performing an unnecessary treatment and a cost caused as a result of leaving a sick patient alone and then leading to his/her death. Moreover, there may be a case where a cost is different according to the social status of a patient. Similarly, in the case of the above example (2), a cost caused as a result of refusing a loan to an excellent applicant is an interest alone. However, a cost caused as a result of granting a loan to a bad applicant may be the entire amount of the loan. The cost is different also in this case according to the respective amount of his/her loan and degree of his/her badness.
As an applicable technique in such a case where costs are different among each target and behavior and they are unknown when prediction is made, cost-sensitive learning has conventionally been proposed (please refer to: N. Abe and B. Zadrozny. An interactive method for multi-class cost-sensitive learning. In Proceedings of ACM SIGKDD Conference, 2004.; J. P. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. E. Brodley. Pruning decision trees with misclassification costs. In Proceedings of the 9th European Conference on Machine Learning (ECML), 1998.; P. Domingos. MetaCost: A general method for making classifier cost sensitive. In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining, pages 155-164, 1999.; C. Elkan. The foundations of cost-sensitive learning. In Proceedings of the 17th International Joint Conference on Artificial Intelligence (IJCAI), pages 973-978, 2001.; W. Fan, S. J. Stolfo, J. Zhang, and P. K. Chan. Ada-Cost: Misclassification cost sensitive boosting. In Proceedings of the 16th International Conference on Machine Learning (ICML), pages 97-105, 1999.; P. Geibel, U. Bredford, and F. Wysotzki. Perceptron and SVM learning with generalized cost models. Intelligent Data Analysis, 8(5):439-455, 2004.; B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of ACM SIGKDD Conference, 2001.; B. Zadrozny, J. Langford, and N. Abe. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the 3rd International Conference on Data Mining (ICDM), pages 435-442, 2003.; and Suzuki. More Advantageous Learning than Accurate Learning—Classification Learning Considering Misclassification Costs—(1) (2). Information Processing, 45(4-5), 2004.). An object of the cost-sensitive learning is not to minimize the rate of behavioral errors, but to minimize the expected value of a cost. Therefore, it is possible to handle problems in a wider range.
Hereinafter, more detailed descriptions will be given of the cost-sensitive learning. Firstly, problems targeted in the cost-sensitive learning will be defined by the following (1) to (3).
(1) Cost Function
A cost means an indicator which shows a loss caused as a result of behavior adopted to a certain target, for example. Assume that X is a set of targets (for example, X=RM) and that Y is a set of behaviors which can be adopted to the targets. It should be noted that Y is assumed to be a discrete and finite set. A cost caused as a result of adopting behavior y∈Y on a target x∈X is assumed to be c(x, y)∈R.
For example, the badness of a result caused when a certain treatment y is performed on a patient having a test result x is c(x, y). If the treatment is appropriate, c(x, y) is small. If the treatment is inappropriate, c(x, y) is large. If the treatment y is extremely inappropriate as a treatment for the patient and leads to his/her death, the cost becomes very large. Incidentally, in a problem setting (please refer to J. P. Bradford, C. Kunz, R. Kohavi, C. Brunk, and C. E. Brodley. Pruning decision trees with misclassification costs. In Proceedings of the 9th European Conference on Machine Learning (ECML), 1998.) in the early study stage, handled is a simple case where: the cost does not depend on x directly; classes are set as latent variables; the cost depends on the class and the behavior; and moreover, the scale of the cost is already known. Here, handled is a more common case where costs are different according to targets and a real cost function c (x, y) is unknown (please refer to B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of ACM SIGKDD Conference, 2001.).
(2) Behavior Decision Model
X is assumed to be a set of targets (for example, X=RM), and Y is assumed to be a (discrete and finite) set of behaviors which can be adopted for the targets. A function used for deciding the behavior y∈Y to the target x∈X is assumed to be the following equation (1).
(Equation 1)h(x,y;θ):X×Y→R   Equation 1
Here, θ is a parameter of the model. In general, using this, behavior y′ which should be adopted is alternatively decided by the following equation 2. h(x, y; θ) may have a probabilistic constraint as in the following equation 3.
                    (                  Equation          ⁢                                          ⁢          2                )                                                                      y          ′                =                              argmax                          y              ∈              Y                                ⁢                                          ⁢                      h            ⁡                          (                              x                ,                                  y                  ;                  θ                                            )                                                          Equation        ⁢                                  ⁢        2                                (                  Equation          ⁢                                          ⁢          3                )                                                                                  ∑                          y              ∈              Y                                ⁢                      h            ⁡                          (                              x                ,                                  y                  ;                  θ                                            )                                      =                              1            ⁢                                                  ⁢                          s              .              t              .                              h                ⁡                                  (                                      x                    ,                                          y                      ;                      θ                                                        )                                                              ≥          0                                    Equation        ⁢                                  ⁢        3            
In other words, when the target x∈X is given, behavior decision on this may probabilistically be made by equation 3 instead of equation 2. Furthermore, it is also conceivable that the behavior decision is a resource distribution type, that is, it is a case where the number of behaviors which can actually be adopted is not one but a diversified investment can be made in h (x, y; θ) in terms of the resource in accordance with the proportion thereof. However, behavior is alternatively decided by equation 2 in the embodiment of the present invention.
In addition, c(x, h, (θ)) is assumed to be a cost caused when behavior on x is decided by using h (x, y; θ). In the case (1) of an alternative action, c(x, h(θ)) is described in the following equation 4.
                    (                  Equation          ⁢                                          ⁢          4                )                                                                      c          ⁡                      (                          x              ,                              h                ⁡                                  (                  θ                  )                                                      )                          =                  c          ⁢                                    (                              x                ,                                  argmax                  ⁢                                                                          ⁢                                      h                    ⁡                                          (                                              x                        ,                                                  y                          ;                          θ                                                                    )                                                                                  )                                      y              ∈              Y                                                          Equation        ⁢                                  ⁢        4            
In a case of a diversified-investment typed action, the definition is not necessarily obvious. However, here, as a simpler case, c(x, h, (θ)) is assumed, as shown in equation 5, that a cost produced by each action is proportional to an investment amount.
                    (                  Equation          ⁢                                          ⁢          5                )                                                                      c          ⁡                      (                          x              ,                              h                ⁡                                  (                  θ                  )                                                      )                          =                              ∑                          y              ∈              Y                                ⁢                                    h              ⁡                              (                                  x                  ,                                      y                    ;                    θ                                                  )                                      ⁢                          c              ⁡                              (                                  x                  ,                  y                                )                                                                        Equation        ⁢                                  ⁢        5            (3) Training Data
A target and a cost are considered to be uniformly generated from a probability distribution D defined by X×RY, and a set E of N pieces of data which have been sampled from D is assumed to be given. Here, the i-th training data of E is assumed to be e(i)=(x(i), {c(i)(x(i), y)}y∈Y). x(i)∈X is assumed to be the i-th target of the training data, and the cost c(i)(x(i), y) is assumed to be given to each action y∈Y on the i-th target.
With regard to the above problems, conventionally, used is a method whose object is to minimize the expected value of a cost in a classification problem which requires a consideration into a cost. Specifically, although θ is desired to be decided in a manner of minimizing an expected cost (equation 6) with respect to the distribution D of the data, since the distribution D is actually unknown, the parameter θ is to be decided in a manner of minimizing an experienced expectation cost (equation 7) (please refer to N. Abe and B. Zadrozny. An interactive method for multi-class cost-sensitive learning. In Proceedings of ACM SIGKDD Conference, 2004., P. Geibel, U. Bredford, and F. Wysotzki. Perceptron and SVM learning with generalized cost models. Intelligent Data Analysis, 8(5):439-455, 2004., and B. Zadrozny and C. Elkan. Learning and making decisions when costs and probabilities are both unknown. In Proceedings of ACM SIGKDD Conference, 2001.)
                    (                  Equation          ⁢                                          ⁢          6                )                                                                                  C            D                    ⁡                      (            θ            )                          =                              E            D                    ⁡                      [                          c              ⁡                              (                                  x                  ,                                      h                    ⁡                                          (                      θ                      )                                                                      )                                      ]                                              Equation        ⁢                                  ⁢        6                                (                  Equation          ⁢                                          ⁢          7                )                                                                                  C            E                    ⁡                      (            θ            )                          =                              1            N                    ⁢                                    ∑                              i                =                1                            N                        ⁢                          c              ⁡                              (                                                      x                                          (                      i                      )                                                        ,                                      h                    ⁡                                          (                      θ                      )                                                                      )                                                                        Equation        ⁢                                  ⁢        7            
It should be noted that it is considered that the target and the cost are generated from the probability distribution D which is defined by X×RY, independently of each other. The set E of N pieces of the data, which has been sampled from D, is assumed to be given as the training data. Here, the i-th training data of E is assumed to be training data e(i)=(x(i), {c(i)(x(i), y)}y∈Y). x(i)∈X is assumed to be the i-th target of the training data and a cost c(i)(x(i), y) of when adopting each behavior y∈Y is assumed to be given.
However, considering from a viewpoint of a risk management, an approach of simply minimizing an experienced expectation cost may not be sufficient. After the training, behavior is assumed to be adopted for M pieces of data. When M is large, a sum of their costs comes close to M·CD(θ). Hence, it seems that there is no problem in setting CE(θ) as an objective function of learning. However, since M is relatively small, the above approximation cannot hold true. Additionally, consideration is given to a case where the generation of a large amount of cost is critical. For example, in a case of a problem of deciding where to invest a fund, a fact that big mistakes occur consecutively some times is a serious problem which is directly connected to the risk of bankruptcy. When the probability of the occurrence is small but there is a possibility that a large amount of cost to an unacceptable degree occurs, a user should wish to avoid its risk as much as possible.
Furthermore, for example, assume that there are two decision functions h1 and h2 which can be expected to obtain the same cost expected values. Although a probability distribution of a cost brought by h1 has a high peak around the expected value, a probability distribution of a cost brought by h2 has a form which has gentle slopes and whose bottom side is wide in a high cost area. In this case, even if the expectation cost is the same, it is presumed that preferred is h1 whose possibility of the occurrence of a high cost is smaller. In such a case, it cannot be said that the object is correctly reflected by the minimization of an experienced expectation cost. Therefore, desired is a learning method in which a risk is avoided more actively, taking the distribution of a cost into consideration.