The present invention relates generally to methods for loading data into data structures, which include aggregated data at multiple levels, and more particularly to a method for loading data into a cube forest data structure, which includes aggregated data at multiple levels, that enables quick searching for particular aggregates and the compact storage of this data.
High corporate and government executives must gather and present data before making decisions about the future of their enterprises. The data at their disposal is too vast to understand in its raw form, so they must consider it in summarized form, e.g., the trend of sales of a particular brand over the last few time periods. "Decision support" software to help them is often optimized for read-only complex queries.
In general, there are four main approaches to decision support:
1. Virtual memory data structures made up of two levels in which the bottom level is a one or two dimensional data structure, e.g., a time series array or a spreadsheet-style two dimensional matrix (time against accounts). This can be generalized so that the bottom level index can have many dimensions. These are the dense dimensions (the ones for which all possible combinations exist). The sparse dimensions are at a higher level in the form of a sparse matrix or a tree. Queries on this structure that specify values for all the sparse dimensions work quite well. Others work less optimally.
2. Bit map-based approaches in which a selection on any attribute results in a bit vector. Multiple conjunctive selections (e.g., on product, date, and location) result in multiple bit vectors which are bitwise AND'd. The resulting vector is used to select out values to be aggregated. Bit vector ideas can be extended across tables by using join indexes.
3. One company has implemented specially encoded multiway join indexes meant to support star schemas (see, e.g., their website at http://www.redbrick.com/rbs/whitepapers/star.backslash._wp.html). These schemas have a single large table (e.g., the sales table in our running example) joined to many other tables through foreign key join (to location, time, product type and so on in our example). This technique makes heavy use of such STARindexes to identify the rows of the atomic table (sales table) applicable to a query. Aggregates can then be calculated by retrieving the pages with the found rows and scanning them to produce the aggregates.
4. Another technique is known as materialized views for star schemas. In this technique a framework is used for choosing good aggregate views to materialize. Each aggregate view corresponds to a node in our template trees. The cost model in this technique disregards the possibility of indexes in measuring query cost. Further, this technique applies directly to current relational systems (or at least current relational systems that support materialized views). The algorithm used by this technique is a simple competitive greedy algorithm for optimizing the views to materialize (subject to the cost model), with guaranteed good properties. It goes like this: suppose that S is a set of views that might be queried. Now consider various views to materialize (including views outside S). Materialize the view that gives maximum benefit according to the author's cost model. While the author's cost model does not discuss indexes, recent work by others gives an algorithm that automatically selects the appropriate summary tables and indexes to build. The problem is NP-Complete, so heuristics are provided, which approximate the optimal solution extremely closely. This approach of a summary table and index approach as measured by space use or update time is relatively inefficient.
Yet another technique is massive parallelism on specialized processors and/or networks. Teradata.RTM. and Tandem.RTM. use this approach, making use of techniques for parallelizing the various database operators such as select,joins, and aggregates over horizontally partitioned data. Masspar, by contrast, uses a SIMD model and specialized query processing methods.
None of these techniques provides the efficiency and query aggregation speed desired by users as data size continues to grow and search engines become more and more sophisticated.
The present invention is therefore directed to the problem of developing a method for structuring data that enables the data to be stored in a storage medium in a compact form, yet permits rapid queries and aggregation of data.