Databases have evolved from simple file systems to massive collections of data serving a variety of users and numerous distinct applications. Online database systems may be configured to receive an ongoing stream of new and updated information, to update their records accordingly, and to allow for relatively small queries of the up-to-date data in real-time. A database warehouse is typically a storage facility for large amounts of data that allows for a more in-depth analysis of the data, using derived and/or aggregated information based on the data to identify groups of records within the database that share common characteristics, for example.
Some database warehouses are used for basic data that is relatively static over time. In such data warehouses, explicitly calculating additional data values that are derived and/or aggregated from the basic data, and dedicating disk space for the storage of the additional data values, may be justified if the data values are frequently used for responding to queries, and especially if the amount of data overall is not very large. However, when a data warehouse stores data that is frequently updated, and when queries to the data warehouse demand very up-to-date information, it is not always feasible to pre-calculate and store the derived and aggregated data. Instead, virtual tables, known as views, may be used to store formulas and instructions for calculating derived and aggregated data values from the basic data. When a query comes in, the view may be calculated on the fly in volatile memory using up-to-date data, and the view's data values may then be made available for responding to the query.
Although the use of views and other types of virtual tables reduces the need for disk storage space and allows for frequent updating of the data, when the data warehouse is used for storing massive amounts of data, calculating and using the views, which will also be huge, can be very taxing of the system's memory and processing resources. Furthermore, accessing and manipulating the massive views in memory in response to a query can take a prohibitively long amount of time.
These deficiencies are especially relevant for views that provide a comprehensive offering of basic and derived data from a massive data warehouse because such views are often generated, at least in part, by the execution of “join” operations between tables, either actual or virtual, that individually provide access to smaller portions of the data. Join operations are very expensive with respect to system memory and processing resources. In addition, the resulting view is typically much larger than either of the tables being joined, since it represents the Cartesian product of the original tables. Because of this, attempts have been made to design database optimization engines that automatically eliminate join operations when possible. However, improper elimination of “inner join” operations may produce incomplete or incorrect results. For any type of join, when the tables being joined comprise tens of millions of records, each of which may comprise thousands of attributes, managing temporary storage in memory, access, and manipulation of the data becomes extremely cumbersome. Complex queries submitted to the system may take many hours, or even more than a full day, to be completed, tying up system resources for the duration. Furthermore, once such a view is generated, reading the records of the view represents a significant processing operation.