1. Field of the Invention
The field of the invention is data processing, that is, methods and systems for financial, business practice, business management, or cost/price determinations.
2. Description of the Related Art
A data mining tool is computer software that analyzes data and discovers relationships, patterns, knowledge, or information from the data. Data mining is also referred to as knowledge discovery. Data mining tools attempt to solve the problem of users being overwhelmed by the volume of data collected by computers operating business applications generally and including particularly those for e-commerce. Data mining tools attempt to shield users from the unwieldy body of data by analyzing it, summarizing it, or drawing conclusions from the data that the user can understand. For example, one known computer software data mining product is IBM""s xe2x80x9cIntelligent Minerxe2x80x9d which is operable in several computing environments including AIX, AS/400, OS/390, Windows NT, and Windows 2000, and Solaris. The IBM Intelligent Miner is an enterprise data mining tool, designed for client/server configurations and optimized to mine very large data sets, such as gigabyte data sets. The IBM Intelligent Miner includes a plurality of data mining techniques or tools, used to analyze large databases and provides visualization tools used to view and interpret the different mining results.
An analytic application is a software application that inputs historical data collected from a production system over time, analyzes this historical data, or samples of the historical data, and outputs the findings back to the production system to help improve its operation. For example, an e-commerce server that manages an internet shopping site is a production system, and an analytic application might use historical data collected from the e-commerce server to report on what type users are visiting the site and how many of these are actually buying products. The term xe2x80x9canalytic applicationxe2x80x9d is used throughout this specification to mean xe2x80x9canalytic software application,xe2x80x9d referring to a category of software typically understood to be used directly by end users to solve practical problems in their work.
Data mining is an important technology to be integrated into analytic applications. Data mining is data processing technology, combinations of hardware and software, that dynamically discover patterns in historical data records and applies properties associated with these records (e.g., likely to buy) to production data records that exhibit similar patterns. Use of data mining typically involves steps such as identifying a business problem to be solved, selecting a mining algorithm useful to solve the business problem, defining data schema to be used as inputs and outputs to and from the mining algorithm, defining data mining models based upon the defined data schema, populating input data schema with historical data, training the data mining model based upon the historical data, and scoring historical data or production data by use of the model.
Analytic applications typically function in a general cycle in which historical data is collected from a production system over time, historical data, or samples of historical data, are analyzed, and findings are output back to the production system to help improve its operation. The quantities of data to be analyzed are large, and the computational demand is intense. The whole cycle is often executed at regular intervals, for example, once daily at night so that reports showing the analytic findings are available for review the next morning. There is an increasing demand, however, to do the analysis faster and more frequently so that the results on business performance are reported back within as little as a few hours, in some cases, as little as two or three hours, or even less. In fact, it appears that there is a trend in this area of technology to press for near real-time analytic reporting.
In prior art, however, with available data mining tools, the end user of an analytic application must be sufficiently skilled in data mining to accomplish all the tasks of data mining, some of which require substantial expertise in data mining. For applications such as e-commerce, which are being widely adopted by businesses of all sizes and in all commerce areas, it is difficult and expensive for every business using data mining to acquire substantial data mining expertise. It would be desirable and useful, therefore, for analytic applications to automate data mining so as to reduce the need for end users to have special expertise in data mining as such.
Until recently, it was impossible to automate the data mining cycle because the steps of identifying a business problem to be solved, selecting a mining algorithm useful to solve the business problem, defining data schema to be used as inputs for mining algorithms, and defining data mining models based upon the defined data schema required substantial expertise and individual human judgment brought to bear at an end user""s location on an ad hoc, case-by-case basis. Recently, however, predefined data mining models have become available founded on previously identified business questions and associated data schema.
For a discussion of predefined data mining models, see the U.S. patent application Ser. No. 09/826,662 filed on Apr. 5, 2001, which is incorporated entirely by reference into this specification.
In analytic applications operating predefined data mining models, a set of business questions that are useful to end users are predefined and the data schema needed to answer these business questions are also predefined. The predefined data mining models for use in this technology are tested and shipped with a product, an analytic application, which is then production trained and applied automatically by end users without needing specialized data mining expertise.
A data mining model is usually defined to address a given business question based on a given data schema. Data mining tools such as IBM""s xe2x80x9cIntelligent Minerxe2x80x9d are generic applications that are operated independently with respect to specific applications. Because such data mining tools in prior art did not include set business questions, predefined data schema, or predefined data mining models, end users would themselves need to analyze business questions, define data schema useful with respect to the questions, and define their own data mining models based upon the data schema. Developers of analytic applications incorporating data mining tools did not in prior art supply predefined data mining models. Without predefined data mining models, the data mining analytic cycle could not be automated.
Accordingly, in analytic applications using data mining tools, there is significant benefit in predefining data mining models whenever possible, as this will enable developers of analytic applications to develop analytic applications capable of automating data mining cycles so that end users may train and apply predefined data mining models with no need for specialized data mining expertise and with no need for end user intervention in data mining processes as such.
It is also true that in prior art, the often cyclic steps of populating data mining schema with historical data, training a data mining model by use of historical data, and scoring historical data or production data by use of the trained data mining model were steps requiring manual intervention. As a practical matter, manual intervention risks delays and missed schedules. There is a need in the art, therefore, for improved methods of data mining.
A principal aspect of the present invention is a method of automated data mining using a domain-specific analytic application for solving predefined business problems. Embodiments typically include populating input data schema, wherein said populating comprises reading input data from a data store and writing the input data to input data schema, the input data schema having a format appropriate to solution of a predefined business problem. Embodiment typically include production training a predefined data mining model to produce a trained data mining model, the predefined data mining model comprising a predefined data mining model definition.
Production training typically has as an input the input data stored in the input data schema, and an output comprising a knowledge base. Production training typically includes executing a preselected data mining algorithm in production training mode. Executing the data mining algorithm in production training mode typically includes executing a software process within the analytic application. The trained data mining model generally includes the predefined data mining model definition and the knowledge base.
Typical embodiments include production scoring input data from the input data schema. Production scoring in typical embodiments includes applying the trained data mining model by executing the data mining algorithm in production scoring mode, wherein the data mining algorithm executed in production scoring mode comprises a software process within the analytic application. Executing the data mining algorithm typically has an output comprising production scored data.
Embodiments of the present invention generally include scheduling the steps of populating input data schema, production training, and production scoring, scheduling further comprising storing in computer memory a schedule. Embodiments typically include executing the steps of populating input data schema, production training, and production scoring, said executing further comprising operating a scheduler in dependence upon the schedule.
Analytic applications typically include the predefined business problems to be solved, wherein the predefined business problems typically have referents defined in a specific computational domain. Analytic applications typically include predefined data mining algorithms capable of using input data read from predefined input data schema for solving the predefined business problems. Analytic applications typically include predefined data schema appropriate for solution of the predefined business problems, the predefined data schema further comprising the input data schema and output data schema. Analytic applications typically include at least one predefined data mining model definition, the predefined data mining model definition is dependent upon the predefined data schema.
Aspects of the present invention include methods, systems, and products in which important elements of data mining are automated within an analytic application. In analytic application embodying the present invention, elements of data mining requiring specialized expertise in data mining, such as identifying a business problem to be solved, selecting a mining algorithm useful to solve the business problem, defining data schema to be used as inputs and outputs to and from the mining algorithm, and predefining data mining models, are performed by an analytic application developer. In typical embodiments of the invention, the analytic application developer identifies a set of important business problems capable of definition sufficient to support data mining solutions. The analytic application developer then selects data mining algorithms useful for solving the identified problems and defines data schema useful as inputs to the selected mining algorithms. The analytic application developer also predefines data mining models based upon the defined data schema. Because the business problems, the data schema, the data mining algorithms, and the data mining models are selected and defined prior to involvement by any end user, the business problems, data schema, mining algorithms, and data mining models are referred to in this specification as being xe2x80x98preselectedxe2x80x99 and xe2x80x98predefined.xe2x80x99
In typical embodiments of the present invention, the data mining steps of populating data schema with historical data, training a data mining model based upon the historical data, and scoring historical data or production data by use of the data mining model are carried out under automation. It is possible to carry out these steps under automation because the steps requiring intervention of human developers with special expertise, defining business problems, preselecting mining algorithms, predefining data schema, and predefining data mining models, are performed by an analytic application developer before the end user acquires the analytic application. The end user need only perform straightforward steps to install and start such an analytic application guided by such routine graphical user interface elements as mouse-clickable buttons, pull down menus, and wizards. The overall effect of the inventive method is to substantially eliminate any need for data mining expertise on the part of the end user and greatly reduce the risk of delays or missed schedules in analytic applications operations.
There are several advantages to the present inventive method. When predefined data mining models are available to end users, end users make use of their regular information technology staff to train and apply these data mining models merely by creating automated schedules, such as Unix cron table entries, with no need to train staff in mining technology and mining tools. A more specific example: the end user""s systems operations staff need not even know the names and locations of data stores operating as inputs and outputs to and from the data mining tools. These reductions in the demands placed upon end users"" operations staff results in significant cost-saving to end users.
An additional benefit of the present invention is that a product vendor, by use of the method of the present invention, builds an e-commerce analytic application in the vendor""s development shop, including capabilities of full automation for the steps that must be performed at the end user""s installation. As a result, the vendor ships several data mining models ready to be used by end users straight out of the box, requiring no expertise in data mining on the part of the end user""s staff. This adds significant value to the vendor""s product, partly because it adds functionality to the vendor""s product, but also because it reduces end users"" costs.
A still further benefit of the present invention is that third-party vendors use the method of the invention to add additional data mining models to an already available analytic product. Use of the present invention increases the demand for such third-party products because adding a new model will cause no corresponding increase in end users"" staff work. In some embodiments, for example, adding new data mining models is accomplished entirely on-line, through networked downloads for example, in a fashion that is completely transparent to the end user. In addition, consultants will use the inventive method to define and add new data mining models at an end user site or to the analytic product itself at a development site.
The foregoing and other objects, features and advantages of the invention will be apparent from the following more particular description of an example embodiment of the invention, as illustrated in the accompanying drawings wherein like reference numbers represent like parts of the invention.