1. Field of the Invention
The present invention relates generally to data mining of normalized databases, and more particularly to a system, method and computer program product for transforming normalized data records with multidimensional attributes into row-entity-integrity tables for use by a data-mining tool.
2. Related Art
Traditional data warehousing uses only partially normalized data models. Typical data warehousing models include star, snowflake or constellation type schema. Such schema commonly utilize a large central xe2x80x9cfact tablexe2x80x9d, and a series of look-up tables for multidimensional attributes. Detailed data for the multidimensional attributes can be aggregated at various levels, depending on the particular need. However, there is still only one level of relation in typical data warehousing models. By storing the warehoused data in this form, traditional data warehousing reduces required storage space, while minimizing difficulties associated with data mining in a normalized database containing multidimensional attributes.
The aforementioned schema work well for traditional data warehousing applications, such as those deployed in the finance, marketing and retail sectors. In these applications, the main facts (tracked attributes) rarely change. Moreover, the tracked attributes almost never change in dimension. For example, attributes such as name, address, phone, age, sex, account number, item purchased, etc. are constants within the data model. This constant nature of the data allows optimization of traditional data warehousing schema for query and analysis. The query and analysis tools are also optimized for the particular application.
The manufacturing industry, on the other hand, has not implemented general data warehousing and analysis methods because the attributes measured in a discrete or process oriented manufacturing system can, and do, vary over time and between product families. For example, manufactures of disk drives track thousands of attributes ranging from time and place of manufacture to product family to head resistance. Many of these tracked attributes are multidimensional, and many change their dimensionality over time, or even cease to exist as a tracked attribute. As new technologies are introduced, the tracked attributes change. This is known as having a xe2x80x9cslowly moving dimension.xe2x80x9d
When a tracked attribute changes, index lines into the database tables, which help in querying the database, must be reconstructed. As the quantity of data being stored in modern data warehousing systems continues to grow, this re-indexing becomes problematic. For example, manufactures of disk drives commonly manufacture many tens of thousands of disk drives each day. For each of these drives, manufacturers typically record and track thousands of attributes, many of which are multidimensional. Additionally, the technology that goes into each disk drive commonly changes significantly every nine to twelve months, and the test programs used for the drives commonly change on a weekly basis, both of which require attribute changes and thus re-indexing. Over a number of years, the amount of data and the number of attributes involved make the re-indexing task exceptionally time consuming and expensive. Thus, modem data warehousing is moving toward using more normalized data models, such as third normal form, to further reduce the use of storage space, to minimize re-indexing, and to reduce modification anomalies caused by ongoing, changes to the tracked attributes.
Data-mining tools are software-based data-analysis methods for finding interesting patterns in large volumes of data. Data mining commonly uses predefined algorithms to look for these patterns in the data. Typical predefined data-mining algorithms include clustering algorithms, correlation analysis, association, decision tree, or neural networks. Traditional data-mining tools require that data be input as a flat table (or flat file), i.e. a single two-dimensional table. Moreover, traditional data-mining tools assume that the flat table. input is a row-entity-integrity (REI) table. An REI table is one in which each row is guaranteed unique, and each row represents a distinct and separate data item. Data-mining tools make this assumption about their input in order to enable the use of simple probability analyses in their data-mining algorithms.
Some current data-mining tools provide Open DataBase Connectivity (ODBC) through which data from a database can be imported and analyzed. However, because the data tracked in many modem data warehouses exist at different levels of aggregation, the data imported by these ODBC tools cannot be provided directly to the data-mining algorithms for analysis without causing problems.
Using traditional data-mining tools with traditional data warehousing systems is difficult and time consuming if the data has multidimensional attributes. Because traditional data-mining tools cannot accept a normalized database with multi-dimensional attributes as input, a person commonly serves as the intermediary between the data warehouse and the data mining. Typically, the person generates queries to select portions of the data and output it as a flat table. This flat table is then given as input to the data-mining tool. However, because the data-mining tool makes assumptions about the input that are not true, its output must always be carefully reviewed for non-sequiturs, i.e. observations about the data which are not relevant or logical.
FIG. 1 is a block diagram illustrating an exemplary normalized database and a prior art database view for use by a data-mining tool. Referring now to FIG. 1, a normalized database 110 comprises multiple tables and a normalized data record 120. Each table in the normalized database 110 has attributes associated with it. For example, the parent table has attributes PK, A1, A2, A3, A4, and A5. Attribute PK is labeled as such because it serves as a primary key for the parent table. The primary key PK uniquely identifies each row in the parent table. Moreover, because each row in the parent table represents a distinct and separate data item, the parent table, by itself, is an REI table and is thus in a form suitable for input to standard data-mining tools.
Child table R, on the other hand, contains multidimensional attributes. The primary key for child table R is compound (or composite) because it includes both a foreign key attribute FK and a local key attribute LK. For example, if FK is a serial number of a particular disk drive, and LK is a head for the particular disk drive, then each attribute D1 through D4 is a data measure for the disk drive, aggregated at the head level. If one wished to perform data mining on child table R at the drive level (which is to say that each drive is considered the distinct and separate data item), then child table R is not an REI table and is not in a form suitable for input to standard data-mining tools.
This is further illustrated by a database view 140, which can be a standard database query, a specialized database script, or the like. A database view is generally a database script, which creates either on the fly, or permanent tables that exist in a denormalized form. These tables can contain actual values, or simply indices into the normalized database 110. When the normalized data record 120 is collected from the normalized database 110, an output data record 150 is created, but this output data record 150 is not an REI data record for the level of aggregation found in the parent table. Data values x1 through x9 (or indices to these values) are repeated unnecessarily, but more importantly, each row does not represent a distinct and separate data item. The overrepresentation of some attributes can cause the data-mining tool to find significant patterns where there are none.
Moreover, when separate but related attributes are aggregated at different levels, a database view will vary the amount of representation of a data item depending on the manner in which the database view is generated. Thus some data will be treated as more significant by the data-mining tool, even though it is not. These overrepresentation issues cause significant problems with the results of a traditional data-mining analysis.
Due to these problems, traditional data-mining analyses of data with multidimensional attributes must always be carefully reviewed. A person must determine when the data-mining tool has found a false pattern caused by the way the query to the data warehouse was generated. The person must iteratively generate new queries in an attempt to eliminate the overrepresentation problem. Typically, eighty percent of the time invested in data-mining activities is spent getting the data into the correct format for the data-mining tool. This bottleneck between data warehousing and data mining is a significant problem for those with data warehouses storing large quantities of data, and a need to find patterns contained within that data.
In addition, modern manufactures have a strong need for data mining using stratified sampling. Stratified sampling is where a probability of taking a data element from a favored population in a data set is greater than a probability of taking a data element from a non-favored m population. Stratified sampling is very useful to manufacturers, which need to model the causes of xe2x80x9cquality bubbles.xe2x80x9d For example, if a manufacturer of disk drives produces one thousand bad drives out of one hundred thousand disk drives produced in a day, this one percent error rate becomes a major problem if all one thousand bad drives are shipped to the same customer.
Some modern data-mining tools provide for stratified sampling, but in practice they do not work when the data contains multi-dimensional variables. With such data, a failed disk drive can have a row in a data table that does not show the failure on the drive. Thus, by chance alone, a failed drive can be misidentified as a non-failed drive in modern data-mining tools, even when the data-mining tool uses stratified sampling.
Therefore, what is needed is a method for transforming a normalized database into an REI table representation for use by a data-mining tool, and an integrated data-mining system, enabling linking of various types of data-mining analysis methods and stratified sampling, for direct use with normalized databases with multi-dimensional attributes and for use with standard database querying tools.
This invention can be regarded as a method for preparing a data record with multidimensional attributes for processing by a data-mining tool. The method includes the steps of identifying a normalized data record, which includes an existing attribute comprising a plurality of dimensions, and generating a plurality of new attributes corresponding to the plurality of dimensions of the existing attribute. The method further includes creating an REI data record comprising the plurality of new attributes. The method further includes associating a plurality of value instances corresponding to the dimensions of the existing attribute with the plurality of new attributes.
This invention can also be regarded as a method for transforming a subset of a normalized database with multidimensional attributes into an REI table for processing by a data-mining tool. The method includes the steps of selecting a data-item key and identifying a composite key comprising the data-item key and a local key attribute. The method further includes the steps of determining a dimensionality of the local key attribute and generating new non-key attributes based upon the dimensionality of the local key attributes and non-composite-key attributes. The method further includes creating an REI table defined by the data-item key and the new non-key attributes. The method further includes associating value instances with the REI table.
This invention can also be regarded as a data-mining system for use with a database stored in a normalized form and for use with a database-querying tool. The data-mining system includes a data-mining core, an attribute pivoter and a system manager coupled with the data-mining core and the attribute pivoter. In a preferred embodiment, the data-mining system further includes an input/output interface providing Open DataBase Connectivity. Also in a preferred embodiment, the data-mining core is designed in a modular fashion such that new data-mining tools can be plugged in without extensive modifications of the data-mining system.
Finally, in one embodiment, the system manager is designed to integrate the attribute pivoter with the data-mining core such that custom. sampling of data, for a first step of data mining, occurs before the attribute pivoter pivots the data. The system manager is further designed to provide integrated and stratified sampling such that favored data is separated and sampled prior to pivoting.
This invention can also be regarded as a data-mining computer program product for causing a processor to execute a process that pivots multidimensional attributes thereby enabling a data-mining system to analyze data stored in a normalized database having multidimensional attributes. The data-mining computer program product includes instructions for identifying a composite key comprising a data-item key and a local key attribute. The data-mining computer program product further includes instructions for generating new non-key attributes based upon a dimensionality of the local key attribute and non-composite-key attributes. The data-mining computer program product further includes instructions for creating an REI table defined by the data-item key and the new non-key attributes. Finally, the data-mining computer program product further includes instructions for associating value instances with the REI table.