The present invention relates to a system and method for revealing the set of necessary and sufficient conditions of the values in one of the data set fields, selected by the user as the dependent variable. With the exception of very small data sets, the necessary and sufficient conditions cannot be revealed manually, nor can known software tools reveal them. The present invention fills this gap.
Data mining, also known as knowledge discovery in databases, has become a new area for database research. There is currently a massive volume of data stored in electronic format, and this has resulted in an increase in use of electronic data gathering devices. The powerful computing power and data storage resources available today have encouraged this data storage phenomenon.
Alongside this accumulation of data, there arose a complimentary need to focus on how to utilize this valuable resource. Businesses soon recognized that valuable insights could be gleaned from this data about, financial risk, quality control and customer behavior. Data mining generally includes the analysis of data and the use of software techniques for finding patterns and regularities in sets of data.
The current data mining techniques are limited to the discovery of if-then rules that relate between the values of the dependent variable and the values of the other fields in the database. These if-then rules present sufficient conditions (the if-condition is a sufficient condition for the then-result), but they fail to reveal sufficient and necessary conditions (that are presented by if-and-only-if rules). Revealing the set of the necessary and sufficient conditions is important for a causal analysis of the data (the cause is the necessary and sufficient condition for the effect).
There is no known program for revealing necessary and sufficient conditions. There are programs for revealing if-then rules, however as mentioned, if-then rules present sufficient conditions only, and not necessary and sufficient conditions. There are several patents for revealing if-then rules and there are several patents for machine learning tools. For example, U.S. Pat No. 5,946,675: Apparatus for machine learning (incorporated by reference for all purposes as is fully set forth herein), and U.S. Pat. No. 5,841,947: Computer implemented machine learning method and system (incorporated by reference for all purposes as is fully set forth herein).
There are several products for revealing if-then rules, for example Wizrule and Wizwhy from Wizsoft, Short Synopsis of Wizwhy and Wizsoft both execute highly accurate data mining using association rules methodologies.
There are several methods for revealing patterns in data and issuing predictions for new cases. These methods use various algorithms such as neural nets, decision trees and association rules. The main problems in these methods are:
(1) They do not present the patterns in the data in an easy to understand way. Neural nets do not present the patterns (the analysis output is a black box), while decision trees and association rules present so many rules that reading all of them is impractical.
(2) None of the methods for issuing prediction is 100% accurate, and therefore there is a place for additional methods that can be used as alternative or complementary approaches. The main problem of the known algorithms is called xe2x80x9coverfittingxe2x80x9d: the algorithm reveals accidental patterns that hold in the data under analysis but fail to hold in new cases.
There is thus a widely recognized need for, and it would be highly advantageous to have, a system and method that can reveal the necessary and sufficient conditions. These conditions are useful for data analysis and machine learning tasks, because:.
(1) The set of necessary and sufficient conditions summarizes the main patterns in the data and presents the causes for the dependent variable""s values.
(2) The set of necessary and sufficient conditions can be used for issuing predictions for new cases. (These predictions will be more accurate than the predictions issued by other machine learning techniques.)
The present invention solves the two previously mentioned problems:
(1) the list of the necessary and sufficient conditions is easy to understand, and
(2) since the necessary and sufficient conditions cannot be accidental, the approach of the present invention does not suffer from the problem of overfitting, and as a result its predictions are more accurate.
The present invention is based on a new algorithm that cannot be derived from the known approaches.
According to the present invention there is provided a system and method for revealing the set of necessary and sufficient conditions of the values in one of the data set fields, selected by the user as the dependent variable. The process is comprised of the following steps:
(1) Revealing the if then rules that relate the dependent variable to the other fields by one of the known association rule systems.
(2) Building a 2-dimensional table where one dimension contains the rule conditions and the other dimension refers to the records of the data. Each cell signifies whether or not the rule-condition holds, and whether or not the dependent variable""s value holds in the record.
(3) Finding in the table sets of rule-conditions satisfying the following requirement: If at least one of the rule-conditions holds the probability that the dependent variable""s value holds is P1, and if all the rule-conditions do not hold the probability that the dependent variable""s value do not hold is P2, where P1 and P2 are above pre-determined thresholds.
The present invention aims to improve data mining results, and improve prediction ability for data mining activities.