This application is based upon and claims the benefit of priority from the prior Japanese Patent Application No. 2000-098977, filed Mar. 31, 2000, the entire contents of which are incorporated herein by reference.
The present invention relates to a method of performing data mining tasks for generating a decision tree from a large sale database and an apparatus therefor.
Data mining has received a great deal of attention as a technique for extracting knowledge from a large scale database. There have been proposed a variety of data mining techniques such as a decision tree, neural network, association rule finding, and clustering. Applications of these techniques for extracting features hidden in the database to a variety of fields such as marketing are expected.
A database as a mining target is not generally a database operating in a mission critical system but uses data constructed as another database (data warehouse) by periodic snapshots. Database updating is not reflected in real time. In practice, data are inserted to the database at once every predetermined period of time. For this reason, to grasp the tendency in the database as a whole, data mining must be performed for the entire database every time data are periodically inserted. A database subjected to data mining often has a large number of data. It takes much time to perform data mining for the entire database every time data are inserted to it.
A decision tree is a typical example of data mining techniques. A tree is created to have as a node a condition for classifying records in a database. A new record is applied from the root of the tree to classify the record. For example, the decision tree is used for an application in which direct mail destinations are limited to only appropriate customers by using the purchase histories and attributes of the customers in retail business.
In the decision tree, a tree structure is created on the basis of data in a table format (called a training set). A plurality of attributes and one class are assigned to the data in the table format. Each attribute is used for classifying each record into one of the class. Each attribute may take a category value (categorical value) or continuous value.
According to the method of creating a decision tree, nodes are so generated as to optimally divide a training set from the root of the tree, and the training set is divided in accordance with this division. Nodes are then repeatedly generated to further optimally divide the divided training sets.
In generating the decision tree as described above, the entire training set must be accessed in order to select an optimal division. Access to the database is required every time a division is repeated. Therefore, it takes much time to generate a decision tree from a large scale database.
The conventional decision tree technique requires recreating a decision tree every time data is inserted into the database or deleted.
It is an object of the present invention to provide a method of performing data mining tasks for efficiently generating a decision tree which reflects the latest contents of the database by applying a decision tree already created to only an inserted or deleted portion, and an apparatus for performing data mining tasks.
It is another object of the present invention to provide a method of performing data mining tasks to divide a leaf node corresponding to insertion data from a decision tree created by a data set into which the insertion data are not inserted.
It is still another object of the present invention to provide a method of, when deletion data is input, performing data mining tasks such that a given node connected to a leaf node, corresponding to the deletion data, of a decision tree created by a data set from which the deletion data is not deleted is merged with other leaf node connected to the given node.
It is still another object of the present invention to provide a method of performing data mining tasks, in which when insertion or deletion data is input, an evaluation value by the division about a passing node is recalculated for a decision tree into which the insertion data is inserted or from which the deletion data is not deleted, and when the recalculated division evaluation value satisfies a specific condition, a partial tree below the passing node is reconstructed.
According to the present invention, there is provided a data mining apparatus using a decision tree, comprising an application section which, when the insertion data is input, applies the insertion data to a decision tree created by a data set into which the insertion data is inserted, to generate an application result; and a modification section which divides a leaf node corresponding to the insertion data in accordance with the application result to modify the decision tree.
According to the present invention, there is also provided a data mining apparatus using a decision tree, comprising an application section which, when deletion data is input, applies the deletion data to a decision tree created by a data set from which the deletion data is not deleted, and generates an application result; and a modification section which merges a given node connected to a leaf node of the decision tree created by the data set, the leaf node corresponding to the deletion data, to other node connected to the given node to modify the decision tree.
According to the present invention, there is also provided a data mining apparatus using a decision tree, comprising an application section which, when insertion data or deletion data is input, applies the insertion or deletion data to a decision tree created by a data set into or from which the data is not inserted or deleted, recalculates an evaluation value by division about a passing node through which the data passes, and generates the recalculated evaluation value; and a modification section which reconstructs a partial tree below the passing node when the evaluation value satisfies a specific condition.
According to the present invention, since the entire large scale database need not be accessed again, the data mining operation performed every time data is inserted or deleted can be performed at higher speed.