This invention relates generally to the field of computer databases, and more particularly relates to a method and apparatus for knowledge discovery from databases.
Throughout the 1980""s and early 1990""s, many major corporations adopted so-called xe2x80x9cbusiness intelligencexe2x80x9d tools such as spreadsheets, report writers, and on-line analytical processing (xe2x80x9cOLAPxe2x80x9d) servers to gain a competitive advantage through better business decision-making. However, the exponential increase in information resulting from the electronic capture of data and its storage in vast data warehouses has dramatically reduced the perceived benefits of such tools. Such tools are valuable for monitoring and planning, but are unable to cope with the large volumes of data or the sophisticated analysis that is required for strategic decision-making if organizations are to achieve or maintain a competitive status.
For many types of businesses, strategic value may be derived from understanding customer behavior and being able to model customers"" responses to evaluate alternative actions. The knowledge required to anticipate behavior cannot be discovered by computer users running a large number of traditional queries against data warehouses. Moreover, answering complex questions through traditional database queries is impractical, since users may not have sufficient time to complete such analyses.
xe2x80x9cKnowledge discovery from databasesxe2x80x9d (referred to herein as xe2x80x9cKDDxe2x80x9d) is perceived by some to be a powerful method of enabling an organization to better understand the dynamics at work in a particular context, for example, in the context of a consumer market for a particular product or service, by automatically searching through large amounts of data, searching for otherwise hidden patterns and relationships of events, and presenting these to the user in a readily understandable format. (An early instance of the use of the term xe2x80x9cknowledge discoveryxe2x80x9d may be found in xe2x80x9cAdvances in Knowledge Discovery and Data Mining,xe2x80x9d Fayyad et al, eds., MIT Press, 1996. KDD systems may be fully automated, freeing up skilled human resources and finding answers to important questions that users might otherwise not known to ask.
Because KDD involves searching for hidden information that is commercially valuable, it is often confused with xe2x80x9cdata mining.xe2x80x9d However, data mining is only one aspect of the KDD process. A KDD process may be broken into several phases, and may be cyclical and iterative, with the results of one phase driving requirements for further phases. Each stage is essential to ensure that knowledge is successfully extracted from data. The identified knowledge can be used to achieve a wide range of objectives, such as making predictions about new data, identifying and explaining hidden patterns and trends in existing data, and summarizing the contents of large databases to facilitate understanding.
A simple example application of Knowledge Discovery is to predict whether a loan application should or should not be granted to a particular applicant. Such a decision can be based on the history of previous applicants who subsequently did, or did not, repay the loan extended to them. Use would be made of data from these previous loans to determine any statistically significant characteristics of applicants who did or did not eventually repay the loan. An algorithm may be trained to assess these features in future applicants and give an indication of how likely the applicant was to repay the loan.
An initial phase of a KDD process focuses on understanding the objectives of the process from a particular perspective. This objective may be converted into a KDD problem definition so that a preliminary KDD plan can be designed to achieve the objectives.
Starting with an initial plan, the user of a KDD system must identify what data is required, where it may be found, what format it is in, and what external sources of missing data are available. This stage of a KDD process provides the first insights into the data and must identify and find solutions for any data quality issues that may exist.
In a data preparation phase, data is xe2x80x9ccleanedxe2x80x9d and transformed to ready it for a data modeling phase. In a data modeling phase, various known techniques may be applied. Some techniques require certain forms of data, so that a reiteration of the data preparation phase may be required. This is the process often referred to as data mining.
After the modeling phase has been completed, the results must be reviewed to confirm that the model used solves the original problem. If not, it must be determined what has been missed. At the end of this evaluation phase, a decision should be reached as to how to use the results to accomplish the identified objectives. To this end, the new knowledge gained must be deployed, i.e., organized and presented in such a way that it may be used effectively.
KDD and data warehousing are complementary concepts addressing a demand for better use of information. An existing data warehouse may provide a rich source of data for KDD, but may still need to be augmented by data sourced from operational systems and external sources.
The need for data warehousing was driven by the requirement in many organizations to better understand the data already existing in different processing systems, and to enable organizations to make better use of such existing data. The need to integrate the data held in different processing systems makes it desirable to centralize data. Data warehouse volumes for many commercial organizations now commonly exceed 100 gigabytes of information, and the number of systems over one terabyte (1,000 gigabytes) is growing rapidly. OLAP systems commonly handle ten to twenty gigabytes of information, with some handling up to 100 gigabytes.
As the volume of available data increases, the number of possible permutations of data relationships grows exponentially. The volume can become too great for users to explore and analyze, increasing the risk that important patterns and relationships may be overlooked. This is the reason that data mining techniques are being increasingly adopted for KDD.
The following table contrasts the sorts of questions that a data warehouse or OLAP tool can answer against those that KDD systems are well-suited to solve:
As the examples in Table 1 illustrate, OLAP and KDD techniques may be advantageously applied in a variety of commercial contexts, including retail organizations, banks, and many others, including marketing, insurance, sales, personnel, medical, fraud detection, customer care.
KDD is perceived by many as the next step in the natural evolution of the reporting and OLAP systems deployed over the last ten or more years. KDD tools and techniques can analyze the same operational data or data warehouse data that populates and OLAP system, although KDD processes may require data preparation specific to the form of algorithm to be applied. Such data preparation may be needed on both operational and data warehouse sources.
On very large data warehouse, KDD techniques may be employed to select the information required for further OLAP analysis, as it may not be feasible to load all of the original data into an OLAP system, or event to know which information would be appropriate to achieve given objectives.
There are a number of perceived deficiencies or limitations on current OLAP, data mining, and KDD products, including: limited scalability (i.e., inability to operate on data stores over a certain limit); unfeasible computer processing requirements; the need for expensive data hygiene; the need for user expertise in the operation of certain systems; and minimal integration with existing data warehouse architectures.
In view of the perceived limitations or deficiencies in existing OLAP, data mining, and KDD products, the present invention relates to an improved method and apparatus for efficient extraction of meaningful knowledge from potentially very large databases.
In accordance with one aspect of the invention, a knowledge discovery (xe2x80x9cKDxe2x80x9d) process is defined in terms of a process plan created by a user seeking to extract knowledge from a database. The process plan comprises a plurality of separate components, with each component representing a stage in the overall KD process. Components in a process plan are connected by data links, such that the output of one component may be applied to the input of another.
The modularization of the overall process plan and the breaking down of the plan into constituent components facilitates the creation of the process plan using a conventional graphical user interface. Further, the modularization of the process plan enables a KD process to be executed on a distributed platform possibly comprising multiple computers interconnected, for example, by a computer network. Such distribution enables systems in accordance with the present invention to take advantage full of available processing resources. Still further, the modularization of the process plan enhances the versatility of the system, since individual components for performing different data transformation and manipulation functions can be introduced into the process plan as xe2x80x9cplug-ins.xe2x80x9d
In accordance with another aspect of the invention, a database from which knowledge is to be extracted is first subjected to a data compression process which substantially and advantageously reduces the overall size of the database to be processed. In accordance with another aspect of the invention, the compression process preferably is of a type in which full decompression of the data is not necessary in order for individual components in the process plan to perform their designated functions.
In accordance with another aspect of the invention, data caches associated with designated links connecting components of the process plan may be provided. Such data caches enhance the overall efficiency of the system, since their availability can reduce the need to restart a process plan from the beginning in the event it is necessary or desirable to repeat or reiterate execution of one or more of the components comprising the process plan.
In accordance with still another aspect of the invention, a data compression process is also employed in the creation and maintenance of data caches associated with links in the process plan. This advantageously reduces the size of the caches, thereby increasing the feasibility of maintaining such caches at multiple points within the overall process plan. Overall speed and efficiency of the KD process is enhanced as additional caches are available, particularly with highly iterative and repetitive process plans.