Many organizations have very large databases that may contain information that is useful but not easily retrievable. Data mining is a technique that seeks to extract meaningful information from a large data set that is generally not built to accommodate easy retrieval of the desired information.
Data mining generally involves identifying data patterns and correlations between data in a given data set. Example information that data mining applications may seek include analysis of market basket purchasing patterns, seasonal purchasing patterns by customers of a large retailer, demographic relationships to airline travel patterns, and credit histories. The patterns discovered through data mining may be used to improve an organization's effectiveness or profitability. Thus, there is a growing demand for tools that provide data mining capabilities.
Data mining often involves first preparing the data for analysis. The necessary preparation depends on the sources of data to be analyzed and on the tool used for analysis. Oftentimes the data to be analyzed may be under management of different database management systems provided by different vendors. This data must be coalesced into a data set compatible with the data mining tool. While each data preparation problem has its own peculiarities, a common need exists for horizontalizing one or more vertical database tables.
The term vertical database table is often used to refer to a table that has at least three columns. The data in one of the columns includes the identifiers of the objects being described, for example, a customer or process identifier. The data in a second column identifies properties or attributes of an object, for example, marital status, age, and income. A third column contains the values of the properties or attributes. An example triplet having the three data items is, <34972225, age, 33>.
Flexibility is an advantage of a vertical database table. There is no need to know in advance the kind of properties that will be associated with an object. When a new property is defined and an associated value is known for a certain object, the triplet, <object identifier, property identifier, value>, may be stored in the vertical database table. Vertical database tables are also useful in describing, in a single structure, the properties of objects of different types, such as where different objects have different properties (not just different property values). Vertical tables generally support fast writing of data, such as may be required in logging program activities.
Horizontalization of a vertical database table involves loading data from a vertical database table into a horizontal database table. A horizontal database table has one row for each object identifier instead of the many rows per object identifier that may be found in a vertical table. A horizontal table has a column for the object identifiers, and a column for each property of an object. The data values within a property column indicate the property values for the objects represented by the rows. Horizontal tables are beneficial to data mining tools, which are generally retrieval oriented. Indeed, most data mining tools expect data to be prepared in such a horizontal format for ease of analysis of all the properties of an object. The horizontal table also supports analysis of the data to determine correlations between properties. In order to arrive at statistically significant results, the source data sets are generally very large. That is, there may be a very large quantity of data that is stored in vertical tables.