1. Field of Invention
This invention relates to OLAP (OnLine Analytical Processing) and Data Warehouse applications, specifically to data structures and algorithms providing better performance and efficiency for these applications by providing a very effective storage and retrieval mechanism for multidimensional data.
2. Description of Prior Art
Data Warehouse and OLAP applications have highlighted the need for a fast way to store and retrieve multidimensional data. To date other attempted solutions have not met this need.
To illustrate the prior art and how it differs from this invention consider sales data dimensioned by zip code, day, and product. Sales data is stored for 50,000 zip codes, 1000 days, and 10,000 products. The sparsity factor is 1% on average so that of the 500,000,000,000 records that would result from all possible combinations of zip code, product, and day only about 5,000,000,000 records actually have associated data and need to be stored in the fact table. In addition to the keys, the record format contains a field for sales so that the record format is represented by FIG. 2A and FIG. 2B below.
Even with a very efficient table and index design, the space required to store only the records with actual data is at least 5,000,000,000.times.(2 bytes+2 bytes+2 bytes+4 bytes+4 bytes) or approximately 65.19 gigabytes.
Couple this with the fact that users of the data may want to query the data by any combination of zip code, product, and/or day. They may want the total sales dollars for a given product and day. Or they may want to know the average sales volume per month in a certain zip code for the past 12 months. Any combination of key fields along with any possible values of those key fields may be chosen. The data truly must be multidimensional with efficient and symmetric query access paths.
There is not a good way to store and retrieve this data with today's database technology. Consider the alternatives:
1) Store the data in a heap table (as is commonly done in relational databases) and use some combination of one or more indexes.
The type of index does not matter. They could be traditional B-trees, more sophisticated star-join indexes, bit maps, and/or hashing routines. No matter how efficient the index is, all table pages that contain data satisfying the query must be accessed at least once in order to produce a query result set. If the data for a zip code, for example, is spread thinly over many table pages as opposed to clustered or grouped together in fewer pages, query performance is degraded by the additional I/O that is necessary to retrieve the extra table pages.
In addition, as many developers and DBAs know from experience, the problem can be much more severe than this. A combination of poor optimizer access plans and the use of dense indexes with low data cardinality often causes more harm than good. The optimizer plan can access a given table page many times during the life of the query in lieu of extracting all rows satisfying the query during a single pass of the table page. This might be true in the business example above if the data in the table were sorted by product and all the data for a group of days for all products were queried. The problem literally can cause a query to run for days. Ironically, via indexes, the query can end up requiring much more time than a single full table scan would have required in the first place.
The use of indexes that are managed as heaps in combination with tables that are managed as heaps makes it difficult to combine indexed access with parallel query processing. Since records that have equal key field values can randomly be stored in the heap, it is difficult to partition the table and index data into component pairs that can be run independently.
As if the other problems were not enough, the solution also wastes too much space. Data must be stored in indexes as well as tables. Albeit, some indexes are more frugal in their use of space than others. Any duplication of data, however, is magnified by the fact that tables of this type can range into several terabytes in size.
While this solution is excellent for OLTP applications, it is clearly less than optimal for OLAP applications.
2) Take advantage of modem hardware technology and store the data in a non-indexed table and allow parallel processors and high capacity I/O subsystems to do the work.
Obviously, this is too inefficient. No matter what degree of parallelism is used, queries in the example above should not have to access 65 gigabytes of data to retrieve a 10 kilobyte result set.
If 10's or 100's of users are running these type queries on one or more tables, database server resources can be quickly depleted.
This solution causes queries to run too long and uses too many resources. The data should be organized in such a way that data accessed together is stored together. Then queries can perform I/O only for the table pages that are needed to satisfy the query.
3) Store the data in a cube-like data structure as depicted in FIG. 1 below. Many niche OLAP products use a similar structure.
If the data were not sparse, this would work pretty well. It would provide efficient and fast access to the data and, the query access would be fairly symmetrical. But, with the sparse sales data in the example above, approximately 6.5 terabytes of data would be required to store a cube large enough to house the data. This would be 100 times larger and 100 times slower in terms of I/O to access than the original data. In addition, a more subtle problem emerges. If the order in which the data is stored in the cube is by zip code then product and finally time, access may not be so symmetric. Access for the outer-most key fields is fine but, restricting the data by choosing a small subset of days from the possible choice of 1000 days may not yield improvement over a full table scan in query performance. If the data is stored in a database with 16 kilobyte table pages, then to access all the sales data for one day, all data in the entire fact table would have to be scanned because an occurrence of that particular day would exist on nearly every table page. This is in contrast to the fact that the data for one day should constitute about 1/1000th of the size of the complete fact table.
This solution is not realistic for sparsely populated OLAP fact tables. Attempts have been made to pack sparse fact tables into smaller cubes but, this further limits the symmetric or multidimensional capabilities of the cube.
4) Use clustering and/or partitioning to "divide and conquer" the data.
This method works pretty well as long as the data is clustered or partitioned in the same way that it is accessed. But, query performance is not symmetric with this solution. If the data is accessed in a way that is different, performance can be as bad or even worse than storing the data randomly or in a heap like structure. This is true since clustering or partitioning the data by one set of keys, distributes or disperses data with the same key values that are not part of the cluster or partition or are not in the prefix of the clustering or partitioning key.
Multiple partitions and clusters can be constructed for the same data. This improves query performance since it provides the clustering or partitioning advantage for a wider range of queries. However, each new cluster or partition requires an additional copy of the complete table. Of course this is very expensive in terms of disk space. And in order to make this approach completely general so that queries involving any combinations of key fields from the composite key of the table can take advantage of the technique, an exponential number of clusters or partitions with respect to the number of fields in the key must be built. To be completely effective, each of these clusters must replicate the data.
For generalized, symmetric OLAP access, this method is less than ideal.
5) Use a special approach just for OLAP or GIS/spatial data that combines the best part of the previously mentioned methods.
The most successful method of this type to date has been the grid file.
The grid file actually works very well in terms of query efficiency since it clusters data based on all key fields in the composite key of an OLAP table. Therefore, no matter which combination of key fields and values within those key fields are used to qualify a query, the grid can be used to narrow the number of table pages that must be accessed in order to retrieve the query set. In addition, grid file indexes or scales are usually designed small enough so that they can be stored entirely in RAM memory to further aid query speed.
Never the less, grid files are not without problems. The way in which data is divided leaves large holes in the storage space thus wasting space and slowing queries down by requiring more I/O.
To compound this, grid files also usually store the actual table pages in a heap like fashion. This makes the combination of grid files with parallel processing more difficult to achieve.
This method also leaves much to be desired but, there is a better way.