The invention relates to generally to a method and a system for optimizing data storage in a row-oriented relational database containing data sets with multiple attributes. Specifically, the invention provides a method and a system for optimally using main memory storage space available for storing tables in a data warehouse system.
Data warehouses are database systems geared at electronically storing a business organization's data in such a way as to facilitate reporting and analysis. In order to accomplish this, data from various sources are collected and integrated into the data warehouse. Moreover, business intelligence tools are provided which enable performing comprehensive evaluations of the collected data and extracting information for supporting business decisions.
Generally, a data warehouse contains a host of distributed and differently structured data. In order to make optimal use of the available memory, the stored data is compressed, for example, by using a frequency-partitioned dictionary approach such as the one described in “Constant-Time Query Processing” by V. Raman et al., Data Engineering, ICDE 2008, IEEE 24th International Conference on Data Engineering, p. 60-69. In the course of this compression, metadata such as frequency distributions and dictionaries are evaluated and associated with the columns of the database tables. This metadata is expensive to compute and requires storage space. Thus, while data compression techniques reduce the memory space required for storing the data, they generally generate a significant amount of metadata which require memory, thus reducing the amount of memory available for storing actual data.
The problem of reduced available free memory due to metadata storage requirements is aggravated whenever the database system is distributed over a cluster of computers connected by a network. In order to achieve even load distribution in the cluster, data are generally partitioned across the cluster nodes without regard to individual values. This requires the complete metadata to be available on each cluster node. Therefore, the amount of memory needed to store the system's metadata increases linearly with the number of cluster nodes, thereby eventually limiting the scalability of the system.
In order to accelerate and facilitate the execution of queries against the database, materialized views are generated which contain all information required for executing a given query. Materialized views are calculated by (partial) denormalization of the underlying database scheme which results in multiple additional views of the underlying data, thus increasing the metadata overhead: In order to enable efficient denormalization, database systems generally use a staged approach in which join operations are performed one after the other. Thus, whenever denormalization requires multiple joins, multiple intermediate joined tables (corresponding to the various join levels) result. These intermediate joined tables need to be retained as additional materialized views for various purposes and, as a consequence, require their own metadata.
In order to enable fast query execution, materialized views containing data for executing the queries have to be present in main memory where storage space is very limited. Other views/tables not immediately needed may be stored on disk where storage space is limited, but less scarce than in main memory.