1. Field of the Invention
The present invention relates to an importance degree calculation program, an importance degree calculation method, and an importance degree calculation apparatus for calculating an importance degree used in Memory-based Reasoning (MBR).
Definitions of the terms used in this specification are as follows.
“Variable” means type of information such as age or gender.
“Category value” means a value represented by a character string such as “man”, “woman”. There is no order relationship between category values.
“Category value variable” means a variable whose value is the category value.
“Numeric variable” means a variable whose value is a numeric value such as age. There is an order relationship between values of the numeric variable.
“Objective variable” means the category value variable serving as a criterion for calculation of an importance degree (to be described later).
“Objective variable value” means a value of the objective variable.
“Objective variable distribution” means the frequency distribution of the objective variable. The total of all objective variable value distributions becomes 1.
“Explanatory variable” means a variable other than the objective variable, which serves as a calculation target in calculation of an importance degree (to be described later).
“Explanatory variable value” means a value of the explanatory variable.
“Instance” means a set of a plurality of explanatory variable values and one objective variable value.
“Instance set” means a set including a plurality of instances.
“Section” means a given range obtained by dividing the explanatory variable. In the case where “age” is used as an explanatory variable, the section indicates, e.g., a range from 20 to 29 years old.
“Importance degree” means importance of a given section of the explanatory variable in the instance set.
Importance degree is calculated with the objective variable as a criterion. For example, the instance set having two instances each including three explanatory variables of “gender”, “age”, and “annual income” and one objective variable of “buying history” is represented as follows.
GenderAgeAnnual incomeBuying historyMan303 millionPresenceWoman204 millionAbsence
2. Description of the Related Art
In recent years, it has become possible to easily store a tremendous amount of information along with development of networks including the Internet, increase in storage density, and improvement in performance and price-reduction of computer components. Accordingly, in a POS (Point Of Sale) system used in distribution industry, it has become possible to collect sales record of branch shops around the nation in a computer system in the central office, and data related to relationship between time and sold goods are stored every second.
Also in other fields, a tremendous amount of information are stored and utilized, such as data indicating relationship between condition of various manufacturing equipment and yield of commodities produced in manufacturing industry, data of customer's credit card usage in finance industry, private data and contract state of insurance contractors in insurance industry. Further, there is an increasing demand that the stored data are used to improve business efficiency.
To calculate which value (section) of which variable is important among a large number of variables is often required in data analysis. The value indicating the importance is called importance degree.
Importance degree is used in an MBR as disclosed in, e.g., Patent Document 1: Jpn. Pat. Appln. Laid-Open Publication No. 2005-302054. The MBR extracts a plurality of instances close in terms of distance to an instance (unknown instance) whose objective variable is unknown from an instance set whose objective variable is known and estimates the objective variable of the unknown instance by a majority among the plurality of instances. At this time, the importance degree is used for calculation of a distance between instances, and emphasis is placed on the explanatory variable having a higher importance degree to increase accuracy in the estimation. Therefore, in order to make highly accurate estimation in the MBR, it is important to accurately calculate the importance degree (refer to Patent Document 1).
As a method for calculating the importance degree of a given value (explanatory variable value or section indicating a given range of e.g., age in the case of a numeric explanatory variable) of a given attribute (explanatory variable), there is a method of calculating the importance degree from the frequency distribution (objective variable distribution) of another given category value variable (objective variable) in an explanatory variable value. For example, as a weight calculation method disclosed in the Patent Document 1, a method of calculating the importance degree based on a difference between an objective variable distribution in an explanatory variable value and the entire objective variable distribution using the following equation (1) is known.qv(c)=p(c|v)/p(c)Wj(v)=Σ|qv(c)/Σqv(d)−1/Nc|/(2−2/Nc)  (1)
In the above equation, Nc is the number of types of objective variable values in an instance set, p(c|v) is the distribution of an objective variable value c in a j-th section vj in an explanatory variable, and p(c) is the distribution of an objective variable value c in the entire instance set. Incidentally, Σ denotes summation over all c or summation all over d.
At this time, it is necessary to accurately calculate the objective variable distribution in a given explanatory variable value in order to accurately calculate the importance degree.
For obtaining the importance degree of an explanatory variable value, a conventional method comprises the following steps: previously dividing an explanatory variable into a plurality of sections; calculating the objective variable distribution in each section; and using the calculated distributions without change to calculate the objective variable distribution.
In this method, however, when the explanatory variable is finely divided, the frequency in each section becomes lower to decrease reliability of the target objective variable distribution, with the result that an error is likely to occur. When, conversely, the explanatory variable is coarsely divided, it becomes impossible to follow a change in the real objective variable distribution, causing difference between calculated distribution and real distribution. FIG. 20 shows an example in which an explanatory variable is coarsely divided so as not to cause an error. As can be seen from FIG. 20, a large difference is observed between the calculated objective variable distribution and a real distribution at the central portion.
If the frequency used in calculation of the objective variable distribution is made higher, it is possible to increase reliability of the objective variable distribution and to reduce an error. Accordingly, a method of calculating the objective variable distribution by using a moving average can be considered.
However, the average width stays constant in a conventional method, so that a problem occurs when the frequency of the explanatory variable drastically changes. More specifically, in a low density part, the frequency of the average width becomes low to decrease reliability of calculated objective variable distribution, so that an error is likely to occur in the importance degree obtained by using the calculated objective variable distribution. On the other hand, in a high density part, calculation of the objective variable distribution is made beyond the required frequency (i.e., including unnecessary part), so that a difference is caused between the calculated objective variable distribution and a real distribution.
FIG. 21 shows an example in which the average width is set wide so that objective variable distribution having higher reliability can be obtained even in a low frequency part. As can be seen from FIG. 21, a large difference is observed between the obtained objective variable distribution and a real distribution in the central part.
As described above, it has been difficult with a conventional method to obtain an objective variable distribution (i.e., importance degree) which is based on fine sections and less subject to an error in an explanatory variable in which frequency drastically changes.