The present invention relates to relational database management systems and, more specifically, to techniques for storing multidimensional data in relational database management systems.
In the context of database systems, a xe2x80x9cdimensionxe2x80x9d is a list of values that provide categories for data. A dimension acts as an index for identifying values of a variable. For example, if sales data has a separate sales figure for each month, then the data has a MONTH dimension. That is, the data is organized by month. A dimension is similar to a key in a relational database. Data that is organized by two or more dimensions is referred to as xe2x80x9cmultidimensional dataxe2x80x9d.
Any item of data within a multidimensional variable can be uniquely and completely selected by specifying one member from each of the variable""s dimensions. For example, if a sales variable is dimensioned by MONTH, PRODUCT, and MARKET, specifying xe2x80x9cJanuaryxe2x80x9d for the MONTH dimension, xe2x80x9cStereosxe2x80x9d for the PRODUCT dimension, and xe2x80x9cEastern Regionxe2x80x9d for the MARKET dimension uniquely specifies a single value of the variable. Thus, dimensions offer a concise and intuitive way of organizing and selecting data for retrieval, updating, and performing calculations.
Multidimensional data may be stored in relational database systems (xe2x80x9cROLAPxe2x80x9d systems) or in specialized, xe2x80x9cmultidimensionalxe2x80x9d database systems (xe2x80x9cMOLAPxe2x80x9d systems). Multidimensional database systems provide structures and access techniques specifically designed for multidimensional data, and therefore provide relatively efficient storage and access to multidimensional data. However, when stored in specialized multidimensional database systems, only applications that are specially built to interact with those multidimensional database systems are able to access and manipulate the data.
On the other hand, when stored in relational database systems, all applications that support interaction with relational databases have access to the data. Such database applications communicate with the relational database system by submitting commands that conform to the database language supported by the relational database system, the most common of which is the Structured Query Language (SQL).
Relational database systems store data in the form of related tables, where each table has one or more columns and zero or more rows. The conventional mechanism for storing multidimensional data in a relational database system is to store the data in tables arranged in what is referred to as a star schema. In relational database systems, a star schema is distinguished by the presence of one or more relatively large tables and several relatively smaller tables. Rather than duplicating the information contained in the smaller tables, the large tables contain references (foreign key values) to rows stored in the smaller tables. The larger tables within a star schema are referred to as xe2x80x9cfact tablesxe2x80x9d, while the smaller tables are referred to as xe2x80x9cdimension tablesxe2x80x9d. FIG. 1 illustrates an exemplary star schema with two dimensions.
Referring to FIG. 1, it illustrates a database 100 that includes tables 102, 104 and 106. Table 102 is named xe2x80x9cstorexe2x80x9d and contains information about each of the stores in which sales may occur. Each row in store table 102 contains a unique store-id and information about the particular store that corresponds to the store-id. Table 104 is named xe2x80x9cproductxe2x80x9d and contains information about each type of product that may be sold in any of the stores. Each row in product table 104 contains a unique product-id and information about the particular product.
Table 106 is named xe2x80x9csalexe2x80x9d and contains information about each sale in each of the stores represented in the store table 102. Each row in sale table 106 includes a dollar amount, a store-id to indicate the store at which the sale was made, a product-id to indicate the product sold in the sale, and the date of the sale. Typically, the number of sales will be vastly greater than both the number of stores at which the sales are made and the number of products carried by the stores. Detailed information about the store and product involved in a sale transaction does not have to be stored in the rows of table 106 because such detailed information is available in tables 102 and 104, respectively. Instead, the rows of table 106 simply contain values (store-ids and product-ids) that reference information stored in the other tables 102 and 104. Therefore, tables 102, 104 and 106 constitute a star schema in which table 106 is the fact table and tables 102 and 104 are dimension tables.
The data stored in fact table 106 only has two dimensions, and therefore fact table 106 only has two columns dedicated to storing foreign key values for those dimensions. In general, a fact table must dedicate one column for storing foreign key values for each of the dimensions associated with the multidimensional data stored in the fact table. Thus, a fact table that stores data associated with twenty dimensions would have to dedicate twenty columns to the storage of foreign key values.
Storing multidimensional data within a relational database has two significant drawbacks. First, the fact table is significantly larger than it would have to be if it only had to store the multidimensional data itself. The massive size of the fact table, relative to the dimension data itself, is largely due to the need to store a foreign key value for each dimension for each multidimensional value. Second, the rows within a conventional fact table have no particular order. Consequently, multidimensional values that are closely related to each other conceptually may be stored relatively randomly throughout the entire fact table. This leads to inefficiencies because multidimensional data that is conceptually related to each other are frequently accessed and manipulated as a group.
An alternative approach to managing multidimensional data in a relational database involves storing the data in relational files but maintaining all multidimensional structure, metadata, administration, and access control using multidimensional database system techniques. Accessing relationally-stored data using multidimensional techniques poses numerous difficulties. For example, when all administration and access to the multidimensional data are controlled exclusively through the multidimensional database system engine, two database management systems must be administered. Further, database applications that access data using conventional relational commands (e.g. SQL commands) are unable to access the multidimensional data.
The approaches described above for storing multidimensional data in relational database systems demonstrate the tradeoffs made by prior approaches, which have either (1) sacrificed the benefits of multidimensional storage to enjoy the benefits of modem relational systems, such as conventional relational access, or (2) sacrificed the benefits of relational storage to attain the efficiency of multidimensional storage.
Techniques are provided which address the problems associated with the conventional approaches for storing multidimensional data in a relational database system. According to one aspect of the invention, the many foreign key values of each row in the fact table are mapped to and replaced by a xe2x80x9creplacementxe2x80x9d value. A mapping function is provided that allows the database server to derive a replacement value from any given combination of foreign key values, and an inverse mapping function is provided to reproduce the combination of foreign key values given the replacement value.
According to another aspect of the invention, a mapping function is selected such that the foreign key value combinations of multidimensional values that are conceptually related to each other map to values that are close to each other. The rows in the fact table are then stored within the fact table in the sorted order, where the replacement value derived from the foreign key values is used as the sort key. Because the sort key value of each row generally reflects the position of the value in that row relative to the multiple dimensions, sorting the fact table based on the derived values causes values that are conceptually related to each other to be stored physically near each other within the fact table.
Various techniques are provided for generating the replacement value from the foreign key values. In general, the process involves subdividing the multidimensional cube that contains all of the multidimensional values into smaller sub-cubes that are referred to as tiles. Each tile has a particular set of coordinates within the larger cube, and each multidimensional value has a particular set of coordinates within the tile to which it belongs. All the tiles that are produced by subdividing the cube in this manner are assigned a single number, where tiles assigned closely related numbers are closely related within the dimensions of the cube. The tile number of the tile in which a particular multidimensional value resides is then stored in the row that contains that multidimensional value in the fact table, replacing the separate foreign key values that were used to derive that tile number.
Variations on the tiling mechanism are provided. According to one approach, the cube is sub-divided into tiles based on the members of a particular level of a hierarchical dimension. For example, one dimension of a multidimensional cube may be xe2x80x9cgeographic locationxe2x80x9d, where geographic location has the following levels: city, state, region, country. In this case, a particular level of the dimension may be selected, such as state, and all values within the multidimensional cube that correspond to a particular state value are considered to belong to the same tile.
According to another tiling approach, the tiles themselves may be subdivided into smaller tiles. This creates a hierarchy of tiles, where the subdividing criteria used for one level of tile may be completely different than the criteria used for subdividing the tiles at a different level.