The advent of a global communications network such as the Internet has perpetuated the exchange of enormous amounts of information. Additionally, the costs to store and maintain such information have declined, resulting in massive data storage structures that then need to be accessed. Enormous amounts of data can be stored as a data warehouse, which is a database that typically represents the business history of an organization. The history data is used for analysis that supports business decisions at many levels, from strategic planning to performance evaluation of a discrete organizational unit. It can also involve taking the data stored in a relational database and processing the data to make it a more effective tool for query and analysis. In order to more efficiently manage data warehousing at a smaller scale, the concept of a data mart is employed in which only a targeted subset of the data is managed.
Whereas many languages used for data definition and manipulation, such as SQL (Structured Query Language), are designed to retrieve data in two dimensions, multidimensional data, on the other hand, can be represented by structures with more than two dimensions. These multidimensional structures are called cubes. A cube is a multidimensional database that represents data similar to a 3-D spreadsheet rather than a relational database. The cube allows different views of the data to be displayed quickly by employing concepts of dimensions and measures. Dimensions define the structure of the cube (e.g., geographical location or a product type), while measures provide the quantitative values of interest to the end user (e.g., sales dollars, inventory amount, and total expenses). Cell positions in the cube are defined by the intersection of dimension members, and the measure values are aggregated to provide the values in the cells.
The information in a data warehouse or a data mart can be processed using online analytical processing (OLAP). OLAP views data as cubes. OLAP enables data warehouses and data marts to be used effectively for online analysis and providing rapid responses to iterative complex analysis queries. OLAP systems provide the speed and flexibility to support analysis in real time.
One conventional architecture that can facilitate OLAP for multidimensional query and analysis is MDX (Multi-Dimensional expressions). MDX is a syntax that supports the definition and manipulation of multidimensional objects and data thereby facilitating the access of data from multiple dimensions easier and more intuitive. MDX is similar in many ways to the SQL (Structured Query Language) syntax (but is not an extension of the SQL language). As with an SQL query, each MDX query requires a data request (the SELECT clause), a starting point (the FROM clause), and a filter (the WHERE clause). These and other keywords provide the tools used to extract specific portions of data from a cube for analysis. MDX also supplies a robust set of functions for the manipulation of retrieved data, as well as the ability to extend MDX with user-defined functions.
Data mining is about finding interesting structures in data (e.g., patterns and rules) that can be interpreted as knowledge about the data or may be used to predict events related to the data. These structures take the form of patterns that are concise descriptions of the data set. Data mining makes the exploration and exploitation of large databases easy, convenient, and practical for those who have data but not years of training in statistics or data analysis. The “knowledge” extracted by a data mining algorithm can have many forms and many uses. It can be in the form of a set of rules, a decision tree, a regression model, or a set of associations, among many other possibilities. It may be used to produce summaries of data or to get insight into previously unknown correlations. It also may be used to predict events related to the data—for example, missing values, records for which some information is not known, and so forth. There are many different data mining techniques, most of them originating from the fields of machine learning, statistics, and database programming.
What is needed is a schema that facilitates interaction of data mining operations across OLAP cubes.