The present invention relates to a data analyzing method and system for analyzing a collection of data expressed in terms of numeric values or symbols which are stored in an information storage unit such as a data base. More particularly the present invention relates to data analyzing method and system for analyzing a collection of data in a data base and processing and converting the analyzed data to obtain an expression or rule useful to users.
With the advancement of the computer technology, the volume of data accumulated in a computer has been increased year by year. This increase in data volume is becoming more and more remarkable mainly in on-line systems such as networking advances. At present, one million records, which corresponds to giga (=10.sup.9) bytes, is by no means rare.
Data stored in a computer are a mere collection of numerical values or symbols. In view of this point, there have been proposed techniques for converting such a collection of data into information useful to users to thereby attain an effective utilization of the data. The method known most widely is a statistical method involving correlation analysis and multiple regression analysis.
Further , a s a relatively new method there is known a method involving conversion into a rule form easily understandable to users such as IF, THEN rules (if . . . , then . . . is . . . ,) that is, a method which uses a knowledge acquisition method called rule induction. For example, on pages 23.about.31 of Hitachi Creative Work Station 2050 (trade name) ES/TOOL/W-RI Explanation/Operation Manual there is described a method which expresses a relation present between data in the form of a rule.
The method originally intented to create, utilizing given data; a rule capable of being input to an expert system. However, such a method is applicable for a user to find characteristics such as causality and regularity which are contained in stored data.
The above described conventional method aims at creating a rule capable of being utilized by a computer. Although it is possible for the user as a human to interpret the rule, the rule is not formed in an easily understandable form to the user. Thus it has been impossible to create a rule suitable for a user interpret the rule and understand characteristics of the data used. The above described method will be explained in more detail below using various examples.
First, suppose that data is a collection of individual events. For example, in an application method of analyzing the cause of a semiconductor defect by using a quality control data base in a semiconductor manufacturing process, each individual case is managed in a manufacturing unit called a wafer, and a set of information pieces such as processing parameters in each manufacturing step or various test results can be handled as one case. FIG. 1 shows examples of such data.
In a method of checking a financial commodity purchasing trend of each customer from a customer data base kept by a bank, a set of such information pieces for each customer such as age, deposit balance, occupation, annual income and financial commodity purchasing history is one case, and the data to be analyzed can be regarded as a collection of such data. As to this example, a detailed explanation will be given in an embodiment of the invention which will be described rater.
Reference will now be made to an example of forming a rule according to the above described conventional method. As an example, suppose that features common to customers who have bought a certain financial commodity ("commodity A" hereinafter) are to be checked. In this case, it is an object to create a rule for classifying, as accurately as possible, cases corresponding to the customers who have bought the commodity A and cases corresponding to the customers who have not bought the same commodity.
According to the foregoing conventional method, from among sets of item values (e.g. "The age is 40 or more and the deposit balance is 10,000,000 yen or more."), there is created a set which classifies given data most accurately. In this case, the term "accurately" is used in the following sense. In a subset of cases having specific values, the higher the proportion of the cases corresponding to the customers who have bought the financial commodity A, the more accurately are classified features of those customers. This set of values can be expressed in the form of a rule such as "IF the age is 40 or more AND the deposit balance is 10,000,000 yen or more, THEN purchase the financial commodity A."
Next, the case which is explained by the created rule is removed from the entire set of cases. In the above example, the case which satisfies the condition of the age being 40 or more and the deposit balance 10,000,000 yen or more is removed. With respect to the remaining set of cases, there is determined a set of items which makes classification most accurately. By repeating these processings it is possible to obtain a group of rules for distinguishing the customers who have bought the financial commodity A from the customers who have not bought the same commodity.
As will be seen from the above explanation, the rule group obtained by the foregoing conventional method takes the form of IF . . . ELSE IF . . . ELSE IF . . . like IF the age is 40 or more and the deposit balance is 10,000,000 yen or more, THEN purchase a financial commodity B, ELSE IF the occupation is self-employed AND the annual income is 8,000,000 yen or more, THEN purchase the financial commodity B, ELSE IF . . . .
When a computer makes classification by using this rule group, the processings can be executed mechanically merely by successively checking from the head IF. However, the larger the number of rules, the more difficult it is for a user to understand features of the customers who have bought the financial commodity A. A further problem is that as the number of cases increases, the processing time required increases rapidly, because the processing of searching for a rule from the remaining set of cases is repeated at every generation of a rule.
There still exists a serious problem such that in the case of data in the actual world like those in the above example, the data must generally be regarded as containing very large noises. That is, as to whether the financial commodity A is to be purchased or not, the decision to purchase may be influenced by items not contained in the data base, so it is impossible to expect the formation of a highly accurate classification rule. Further, when analyzing the cause of a semiconductor defect referred to above, large noises are contained in the data because the occurrence of a defect is influenced by factors which vary randomly. Also in such a case it is often difficult to request the formation of a definite rule.
Against the above-mentioned problems it is effective to adopt an analyzing method which expresses rough features of data. In the foregoing conventional method, however, the value of many items are combined and a search is made for a rule as accurate in classification as possible, so there generally occurs a phenomenon such that the number of conditions appearing in the IF portion of the rule increases, but the number of cases falling under the rule decreases. Consequently, it is difficult to satisfy the purpose of understanding rough features of data.
In the actual data base are stored a wide variety of information. Those obviously having nothing to do with the purpose of the analysis are also included, such as wafer number and manufacturing start year, month, date in the foregoing semiconductor quality control data, as well as name and telephone number in customer data. On the other hand, there also are information pieces which may be effective in the analysis such as product classification code in the semiconductor quality control data and address in the customer data.
In making such analysis as in the above example by using data comprising many kinds of information, therefore, it is necessary for the user to designate in advance what items of data are to be used to perform classification. This work becomes more and more complicated with an increase in the kinds of information. If all of such items that might have any relation are to be included in the analysis, the number of items used increases inevitably, resulting in an increase of the processing time. Then, if the analysis is to be made efficiently, it is required to carefully select items to be used and hence the quality of analysis results obtained greatly depends on the degree of the user's know-how.
In some case, moreover, some particular items require that where the commodity purchasing trend differs between districts in an example of customer data, the question whether the item "address" should be considered at a prefectural level, or at such a level as Tohoku district/Kanto district, or dividedly into two, East Japan/West Japan, cannot basically encounter a proper answer until analysis is complete. If all such cases are to be covered, it is required to repeat the analysis of the conventional method many times, thus causing the problem that the user's burden increases.
To avoid such inconvenience, there has been proposed a method wherein all of the viewpoints at various levels are added as data items. As to address, for example, the items of classification by prefectures/classification by districts/classification by east and west can be considered as data items to be analyzed. In this method of analysis, however, the processing involves a large waste because no consideration is given to the correlation between items in point of meaning.
For example, while classification by prefectures is tried, it is originally not necessary to consider the value of items such as classification by districts and classification by east and west that are at higher levels, but nonetheless the method makes such a wasteful analysis. Further, in some particular item processing sequence, a rule which is evidently redundant in its meaning is likely to be created such as "IF address is Kanto district AND address is Kanagawa prefecture THEN . . . ." Where classification by a computer is to be made, such redundancy does not affect the classification accuracy, but it negatively effects the user's understanding of the features of data.
Moreover, in an actual data base there often is contained an unknown data item called a deficit value. When analysis is made by a statistical method or the like, it is inevitably required to merely disregard such deficit value and not to consider it to be data. Also in the foregoing rule induction method, a data item having a deficit value does not affect the classification accuracy and so it does not appear in the classification rule created.
However, there sometimes is a case where the deficiency of a data value is itself significant. For example, when the item of address is a deficit value and it means the presence of an anonymous account in a bank, this fact may affect the purchase of the financial commodity A. In such a case, the rule "IF address is a deficit value THEN . . . " becomes significant. According to the conventional method it is generally impossible to create such a rule. Besides, there has been a problem that for the formation of such a rule it is necessary to perform manual processing for converting the deficit value into a specific value explicitly.
According to the conventional method, moreover, since rules are formed with priority given to the classification accuracy, general rules are not always created first in the sense of explaining as many cases as possible. On the other hand, in the data analyzing method being considered, the user may cause an interrupt during processing if the processing time is long. In such a case, if general and simple rules which are highly useful to the user are created first, it becomes possible to utilize the rules which had been created before the time when interrupt was generated. Such a type of utilization cannot be made by the conventional method.