1. Field of the Invention
The present invention relates to a computer-readable recording medium, an apparatus and a method for calculating scale-parameter.
2. Description of the Related Art
Recent developments in networks typified by the Internet, increases in capacities of storage media, increases in performance and cost reduction of computers, and the like have allowed to readily store an enormous amount of information. As a method of capitalizing such accumulated data in a business world, solving a prediction problem has received attentions. The solution uses a set of cases of which results are already known (hereinafter, “known cases”) corresponding to accumulated data to predict a result of a case of which result is unknown (hereinafter, “unknown case”).
Specific examples of the prediction problems include: narrowing addressees of direct mails to increase a response rate; prediction about investment-credit risks; detecting illegal users of credit cards; and detecting unauthorized accesses to a network. As a solution of such a prediction problem, a method of searching for cases that are similar to a prediction target corresponding to the “unknown case” among a set of known cases corresponding to accumulated data, and predicting a result of the unknown case based on a set of the thus-retrieved similar cases is known (prediction based on similar cases).
In the method, each case includes a plurality of explanatory variables each expressed by a numerical value (in the form of “explanatory variable: numerical value”; e.g., “age: 30” and “annual income: 4 million yen”) and a single objective variable expressed by a character string (in the form of “objective variable: character string”; e.g., “purchase status: purchased” and “purchase status: not purchased”). A known case has a known “objective variable” corresponding to the result, whereas an unknown case has an unknown “objective variable”. In the method, the objective variable of the unknown case is to be predicted based on objective variables of the similar case set retrieved from the known case set. To make the prediction, it is necessary to calculate a distance (inter-case distance) between the unknown case and each known case. However, explanatory variables generally have different distributions; for example, “ages” and “annual incomes” of the case are distributed in different ranges of values. Hence, normalization (scaling) is required.
For example, in Japanese Patent No. 3762840 according to the present applicant, Euclidean distance taken for each explanatory variable is divided by a scale parameter (e.g., the standard deviation of a known case set) of the explanatory variable. This allows to calculate a distance (hereinafter, “inter-case distance”) between a known case and an unknown case based on distances (hereinafter, “inter-explanatory-variable distance”) taken between each explanatory variable value of the known case and that of the unknown case while normalizing values of each explanatory variable to have a common distribution range.
The above mentioned conventional art is disadvantageous in that, when the explanatory variable values of the unknown case include a value (outlier) significantly deviated from the distribution of explanatory variable values of the known cases, it is difficult to obtain an accurate prediction result even when the scale parameter is calculated based on the explanatory variable values of the known case set.
More specifically, even when a standard deviation obtained from the known cases is used as the scale parameter, the outlier makes it difficult to perform sufficient scaling of the explanatory variable values of the unknown case. Hence, the outlier has a larger influence on the inter-case distance than other explanatory variable values, which deteriorates accuracy of prediction.
An example of making such a prediction based on a known case set of, as shown in FIG. 14A, nine known cases of data names “#1” to “#9” will be described. Each known case includes “age (years old)” and “annual income (ten thousand yen)” as the explanatory variables, and “purchase status: purchased or not purchased” as the objective variable. Among the nine known cases, the standard deviation of the “age” is 8.2 years old, and that of the “annual income” is 820,000 yen. Therefore, 8.2 is the scale parameter of the “age”, and 82 is the scale parameter of the “annual income”.
An example of making a prediction about the objective variable “purchase status” of an unknown case shown in FIG. 14B having “age: 50” and “annual income: 450” as the explanatory variables will be described below. An inter-explanatory-variable distance of the “age” between the case #1 and the unknown case is a value (hereinafter, “first value”) obtained by dividing the absolute deviation between 30 and 50 by 8.2. An inter-explanatory-variable distance of the “annual income” between the case #1 and the unknown case is a value (hereinafter, “second value”) obtained by dividing the absolute deviation between 300 and 450 by 82. The inter-case distance between the case #1 and the unknown case is calculated as the square root of the sum of the square of the first value and the square of the second value.
A table shown in FIG. 14B contains inter-case distances, arranged in increasing order of distance, between the unknown case and the nine known cases taken for all the combinations thereof. When the top three known cases (#6, #9, and #5) are retrieved as cases similar to the unknown case, the objective variable “purchase status” thereof are all “purchased”. Hence, the unknown case having “age: 50” and “annual income: 450” as the explanatory variables can be predicted accurately to have “purchase status: purchased” as the objective variable.
Meanwhile, another example of making a prediction about the objective variable “purchase status” of an unknown case having “age: 50” and “annual income: 800” as the explanatory variables will be described. Referring to a table shown in FIG. 14C, inter-case distances between the unknown case and the nine known cases taken for all the combinations thereof are calculated using the scale parameters, and arranged in increasing order of distance. The top three known cases (#9, #8, and #7) retrieved as cases similar to the unknown case include the case #7, having “purchase status: not purchased” as the objective variable. The case #7 is undesirably retrieved because the explanatory variable “annual income: 800” of the unknown case is significantly deviated (being an outlier) from a distribution of values of “annual income” of the known cases.