Data mining is a technique for finding useful knowledge previously unknown from a large amount of information. To efficiently conduct data mining, a process for generating a new feature by processing the features used for the data mining.
As a method for generating a new feature, is known a method in which each feature is represented as a two-valued feature, and the two-valued features are combined with AND/OR operators, which generates a logical formula as a new feature.
For example, to represent each day of the week, the days can be represented as seven types of two-valued features (IS_Sunday, IS_Monday, IS_Tuesday, IS_Wednesday, IS_Thursday, IS_Friday, and IS_Saturday). Similarly, to represent a day with ante meridiem or post meridiem, a day can be represented as two types of two-valued features (IS_a.m., and IS_p.m.).
Based on the two-valued features, a new feature “weekend post meridiem” can be generated. Specifically, a logical formula that is a combination of the two-valued features with AND/OR operators “(IS_Saturday AND IS_p.m.) OR (IS_Sunday AND IS_p.m.)” represents a feature “weekend post meridiem”.
In order to solve an actual problem, it is often necessary to generate a new feature by appropriately combining features as described above. It is, however, not so easy to find an appropriate way of combining features. For example, when original data includes 100 features and five features of the 100 features are combined with AND/OR operators, there are logical formulae of combinations on the order of 1005×24 (in other words, 160 billion). Thus, simply combining the two-valued features wastes a large amount of memory and an immense amount of time for calculation.
NPL 1 and NPL 2 describe methods for enumerating features. In the methods described in NPL 1 and NPL 2, the features that are combinations of features with AND operators (Disjunctive normal form (DNF)) are enumerated, and then the enumerated features are combined with OR operators, which generates a new feature.
NPL 3 describes a method for extracting the patterns in DNF frequently used. NPL 4 describes an exemplary method for assessing features.