1. Field of the Invention
The present invention relates to a data mining apparatus for discovering an unknown rule hidden in data by a mathematical method such as clustering or classification and to a storage medium in which a data mining processing program has been stored. More particularly, the invention relates to a data mining apparatus for displaying an unknown rule discovered by data mining so that the user can easily understand it and for enabling the unknown rule to be externally utilized and to a storage medium in which a data mining processing program has been stored.
2. Description of the Related Arts
In recent years, attention has been paid to data mining for automatically discovering an unknown rule from a large amount of data of Giga bytes or Tera bytes accumulated for a long time till now by a mathematical method. The data mining has: a xe2x80x9cdiscovery-like approachxe2x80x9d to classify and refine information on the basis of a certain hidden rule, thereby automatically finding out information which cannot be manually found; and a xe2x80x9cverificative approachxe2x80x9d to analyze uncertain known information and add certainty to the information.
Hitherto, according to the data mining, an engine having an application interface is called and a result is reported. With respect to it, there are various methods of reporting the result. A display format with high visibility for each analyzing algorithm has not been established yet. Accordingly, although the data mining has high intelligent engine function and performance, the data mining is not introduced in general systems very often.
The data mining includes: clustering for classifying data having similar characteristics into clusters (classes) and extracting an unknown rule; and classification for extracting an unknown rule by expressing characteristics of a specific analysis item by a function or a profile using the other analysis items as condition values with respect to a group of data having a plurality of analysis items as targets. The clustering automatically collects similar data into the same group by using a conventional algorithm called a Word method or the like. In this case, the data can be divided into any number of groups in accordance with the designation of the user. In JP-A-11-15897, the results obtained by designating a certain division number and clustering data are plotted to axes of a plurality of analysis items of a parallel coordinate graph and a polygonal line of each record is overlapped thereto, thereby displaying. Although the clustering divides the data on the basis of the designated division number into groups, the optimum division number cannot be found soon even when the clustering result is expressed on the parallel coordinate graph. In order to obtain the optimum division number, the user pays attention to the axes of a plurality of analysis items, analyzes a tendency of the data, and judges which division number is the best, so that he finally knows the proper division number. However, when the division number is large or a range of division is wide, an extreme troublesomeness is caused to decide the proper division number.
On the other hand, the classification generally uses a decision tree or a recurrence tree. In many cases, a rule extracted by using the algorithm of the decision tree or recurrence tree is visualized in a format of a tree diagram which branches on the basis of condition values which are automatically formed.
However, the tree diagram for expressing the result of the classification tends to display a complicated multilayer in which a root is set as a start point, the tree diagram branches at multi-stage nodes, and each branch finally reaches a leaf. It is difficult to grasp a rule having significance from such a tree diagram. Information expressed in the tree diagram obtained as a result of the classification is merely formed as drawing information and used to discover a rule having significance from it by the user.
According to the invention, there is provided a data mining apparatus for improving a display of a rule discovered by data mining, thereby enabling the user to easily understand it and easily discover a rule having significance.
According to the invention, there is provided a data mining apparatus in which a rule discovered by data mining can be used by an external application.
According to the invention, there is provided a data mining apparatus for discovering an unknown rule included in a data group, comprising a clustering processing unit and a classification processing unit which function as a data mining engine.
According to the invention, first, the clustering process has the following features.
(Simultaneous Display of the Classification Result and the Division Number)
The data mining apparatus of the invention comprises: a division number designating unit for designating a division range of 2-division to an arbitrary division number N; a clustering processing unit for classifying data having similar characteristics into a plurality of clusters (classes) every division number within a range of 2-division to the designated division number N with respect to a group of data having a plurality of analysis items as targets; and a display processing unit for simultaneously displaying a plurality of processing results obtained by the clustering processing unit.
Particularly, the display processing unit displays a parallel coordinate graph as a polygonal line by plotting the classification result of the designated division number N onto an axis of each analysis item and arranges the dividing axes of 2-division to the designated division number N, for example, N=5-division, thereby simultaneously displaying a transition of the division and a connection between the classification results by a polygonal line. In this manner, by simultaneously arranging and displaying the transition of the division based on the display of the dividing axes of 2-division to the designated division number of, for example, 5-division and the clustering results at the designated division number, it is necessary to again analyze the reason why the data has been classified into the specific group among the divided groups from another viewpoint, thereby enabling the proper division number to be easily determined. In other words, by simultaneously comparing a plurality of analysis items, which grouping is the best can be known when customer information or the like is grouped. The clustering can be used in a specific business field.
(Annual Ring Display of the Classification Results and the Division Numbers)
The display processing unit converts the classification result of each of the division numbers from 2-division to the designated division number N into an annual ring diagram and displays it. The annual ring diagram expresses the division numbers in the increasing order from the inner annual ring toward the outer annual ring and expresses a data distance between the clusters divided into widths (thicknesses) in the radial direction of the annual ring, thereby allowing the division number of the annual ring having the largest width to be recognized as a proper division number. The clustering is characterized in that a large amount of data is divided into groups having similar tendencies by a unique algorithm, and the user designates the division number upon dividing. The user also judges whether the designated division number is proper or not. According to the annual ring diagram of the present invention, the proper division number can be presented to the user by displaying the significance of the division every division number. Consequently, the grouping based on a plurality of analysis items such as customer information and the like can be significantly performed.
The invention has the following characteristics as a classification.
(Folding of the Node)
The data mining apparatus of the invention comprises: a classification processing unit for forming characteristics of a specific analysis item among a plurality of analysis items by predicting an unknown rule in which the other analysis items as condition values with respect to a data group having a plurality of analysis items as targets; and a display processing unit for, when a result of formation of the classification processing unit is expressed and displayed as a tree diagram, converting it into a tree diagram in which nodes having no significance are not displayed and displaying the tree diagram. The plurality of analysis items processed by each data group are called attributes or segments of data. For example, in case of the attributes, classification can be mentioned as a method of forming a function or a profile for predicting a specific attribute from values of other attribute groups. In the tree diagram which is formed as classification by the algorithm of the decision tree, significance of the nodes and leaves is shown as information by branch trimming which is mechanically performed on the basis of a confidence degree. However, when information of various analysis items is classified by the decision tree, the numbers of nodes and leaves is enormously large, so that it is impossible to discover important information by the eyes. In this instance, according to the invention, unnecessary branching conditions in the tree diagram are not displayed and the relation between the nodes and the leaves is displayed simply. Consequently, with respect to certain itemized information whose characteristics are desired to be known, it is possible to easily grasp by which kind of rule using the other analysis items as conditions such information has been classified. It is possible to support the operation for grasping customer characteristics in customer information or the like.
(Narrowing Conditions)
There is provided a narrowing condition designating unit for narrowing down a range of data which is processed in the classification processing unit of the invention by the designation of the user. The narrowing condition designating unit narrows down a range of the number of layers in the classification, a range of the number of records, a range of each item value, and the like by the user designation. As data to be subjected to the data mining, there is a large amount of data of giga bytes or tera bytes. When all data is used, it takes a very long time to analyze data and display the result. According to the invention, since the range of data which is handled in the mining can be designated, a large amount of data is narrowed down and the mining analysis can be performed in a short time. Since only the data corresponding to the necessary conditions can be extracted from the mining result, a rule having significance can be easily extracted. The user designation for the narrowing conditions is also applied to the clustering. In the clustering, the narrowing condition designating unit narrows down the range of the number of records, range of the item values, or the like on the basis of the user designation.
(Improvement of the Tree Diagram)
When the formation result of the unknown rule extracted from the data by the classification processing unit is expressed by a tree diagram and displayed, the display processing unit gives changes based on a plurality of attributes to the shapes, colors, and/or sizes of the nodes and leaves. The display processing unit changes the shapes, colors, and/or sizes expressing the nodes and leaves of the tree diagram by using, for example, the number of records and the confidence degree as attributes. In many cases, the rule which is formed as a decision tree and numerical information such as the number of records, confidence degree, and the like are fundamentally exhibited as character information in the tree diagram. According to the invention, by expressing the numerical information by the shapes, colors, and the like of the nodes and leaves of a branch node, the tendency of the data can be more intuitively grasped.
(Sorting of the Tree Diagram)
When the formation result of the unknown rule extracted from the data by the classification processing unit is expressed by a tree diagram and displayed, the display processing unit evaluates the significance of the nodes and leaves and sorts the tree diagram on the basis of the significance. In this manner, the significance of each of the nodes and leaves is evaluated on the basis of the number of records or the confidence degree, the tree diagram is sorted in the ascending or descending order of significance, and the tree diagram is expressed so as to be easily understood, thereby enabling a hidden rule to be easily discovered. Consequently, even if a conditional sentence which is presented as character information, an x2 inspection value, or the like is not verified, it is possible to perform the sorting, narrowing, and the like of the data belonging to similar classifications.
(Use of the Data Mining Result)
According to the invention, the data mining apparatus further has an output processing unit for converting the processing result of the classification processing unit into a format that can be used outside and outputting it.
(Inquiry of the Database)
The output processing unit converts a specific rule extracted from the result obtained by the classification processing unit into a conditional expression and outputs it to the outside. In this case, the output processing unit forms the extraction rule in a format of xe2x80x9cIFxcx9cTHENxcx9cxe2x80x9d, converts it into a data extraction language which is used in the database, and outputs it. The output processing unit converts the extraction rule into an inquiry conditional expression for an application for controlling an SQL sentence, an LODQL sentence, an MDB command, or the like which is used in the database, and outputs it. Consequently, the rule of the data formed by the decision tree and recurrence tree of the data mining is designated to the data extracting conditional sentence to a relational database, a multidimensional database, and a multimedia database, thereby enabling the data to be extracted. In the rule formation in the data mining, an unknown classifying condition which is discovered by the algorithm of the classification is displayed as a tree diagram. By showing it as a data extracting condition to the database, data can be extracted from the database by a cut end of the unknown condition. Consequently, the extracted data can be used for a ranking process of the customers, selection of the customers as targets of marketing, and the like using the rule of a new analysis item condition which could not be discovered so far.
(Macro for the Spreadsheet)
The output processing unit converts the extraction rule into a macro module (macro) of a spreadsheet and outputs it. Accordingly, a macro functioning as a filter in which the conditional sentence in the xe2x80x9cIFxcx9cTHENxcx9cxe2x80x9d format formed by the data mining can be used by a famous spreadsheet product such as Microsoft Excel or the like is formed. The partial rule extracted from the data mining result is fed back to the macro module of the spreadsheet, so that the result of the data mining can be used as one of tools for analyzing the database. By forming the conditional sentence indicative of a feature of the data formed by using the algorithm of the classification as a micro module for extracting data from the spreadsheet such as Excel or the like, simple data extraction on a personal computer by an unknown cut end can be simply performed. Owing to the feature such that the macro module can be distributed again, when the customers are selected from the customer information, the unknown analysis item condition discovered by the data mining can be used as a cut end of the information analysis.
(Making of the Text From the Tree Diagram)
The output processing unit converts the tree diagram obtained by the classification processing unit into drawing information which can be drawn by an external application and outputs it. By making the text from the drawing information of the tree diagram obtained as a result of the classification as mentioned above, the condition of the branch node of information which is inherently expressed as a tree diagram, a ratio of the records included in the branching condition, a confidence degree of the condition, and the like are outputted as information to a file. The tree diagram can be displayed and used in another application. As the most general algorithm among the algorithms for the classification, there is a tree diagram as a display of the result of the decision tree. According to the invention, the tree diagram obtained as an analysis result using the algorithm of the decision tree is converted into the drawing information which can be used by the user, so that it is possible to draw the tree diagram of the product of an independent software vendor (ISV) or the tree diagram peculiar to the user. Consequently, it is possible to develop and use other products in each of which a mining engine of the decision tree has been assembled, so that a width in use of the decision tree becomes wide.
(Customization of the Extraction Rule)
The output processing unit converts the rule in the xe2x80x9cIFxcx9cTHENxcx9cxe2x80x9d format extracted from the result of the classification processing unit into a format designated by the user and outputs it. Thus, there is provided an interface function which can customize even the rule in the xe2x80x9cIFxcx9cTHENxcx9cxe2x80x9d format into a desired format of the user and display it. As mentioned above, since the rule in the xe2x80x9cIFxcx9cTHENxcx9cxe2x80x9d format discovered by the data mining can be customized to the desired format of the user, the result of the data mining can be fed back and used in data management of the like of an actual business.
According to the invention, there is provided a computer-readable storage medium which stores a data mining processing program for discovering an unknown rule contained in a data group. In this case, the data mining processing program has processing steps having the same functions as those in the case of the apparatus construction.