The present invention relates generally to methods for querying data structures, which include aggregated data at multiple levels, and more particularly to a method for querying a cube forest data structure, which includes aggregated data at multiple levels, that enables quick searching for particular aggregates and the compact storage of this data.
High corporate and government executives must gather and present data before making decisions about the future of their enterprises. The data at their disposal is too vast to understand in its raw form, so they must consider it in summarized form, e.g., the trend of sales of a particular brand over the last few time periods. xe2x80x9cDecision supportxe2x80x9d software to help them is often optimized for read-only complex queries.
In general, there are four main approaches to decision support:
1. Virtual memory data structures made up of two levels in which the bottom level is a one- or two-dimensional data structure, e.g., a time series array or a spreadsheet-style two dimensional matrix (time against accounts). This can be generalized so that the bottom level index can have many dimensions. These are the dense dimensions (the ones for which all possible combinations exist). The sparse dimensions are at a higher level in the form of a sparse matrix or a tree. Queries on this structure that specify values for all the sparse dimensions work quite well. Others work less optimally.
2. Bit map-based approaches in which a selection on any attribute results in a bit vector. Multiple conjunctive selections (e.g., on product, date, and location) result in multiple bit vectors which are bitwise AND""d. The resulting vector is used to select out values to be aggregated. Bit vector ideas can be extended across tables by using join indexes.
3. One company has implemented specially encoded multiway join indexes meant to support star schemas (see, e.g., their website at http://www.redbrick.com/rbs/whitepapers/star _wp.html). These schemas have a single large table (e.g., the sales table in our running example) joined to many other tables through foreign key join (to location, time, product type and so on in our example). This technique makes heavy use of such STAR indexes to identify the rows of the atomic table (sales table) applicable to a query. Aggregates can then be calculated by retrieving the pages with the found rows and scanning them to produce the aggregates.
4. Another technique is known as materialized views for star schemas. In this technique a framework is used for choosing good aggregate views to materialize. Each aggregate view corresponds to a node in our template trees. The cost model in this technique disregards the possibility of indexes in measuring query cost. Further, this technique applies directly to current relational systems (or at least current relational systems that support materialized views). The algorithm used by this technique is a simple competitive greedy algorithm for optimizing the views to materialize (subject to the cost model), with guaranteed good properties. It goes like this: suppose that S is a set of views that might be queried. Now consider various views to materialize (including views outside S). Materialize the view that gives maximum benefit according to the author""s cost model. While the author""s cost model does not discuss indexes, recent work by others gives an algorithm that automatically selects the appropriate summary tables and indexes to build. The problem is NP-Complete, so heuristics are provided, which approximate the optimal solution extremely closely. This approach of a summary table and index approach as measured by space use or update time is relatively inefficient.
Yet another technique is massive parallelism on specialized processors and/or networks. Teradata(copyright) and Tandem(copyright) use this approach, making use of techniques for parallelizing the various database operators such as select,joins, and aggregates over horizontally partitioned data. Masspar, by contrast, uses a SIMD model and specialized query processing methods.
None of these techniques provides the efficiency and query aggregation speed desired by users as data size continues to grow and search engines become more and more sophisticated.
The present invention is therefore directed to the problem of developing a method for structuring data that enables the data to be stored in a storage medium in a compact form, yet permits rapid queries and aggregation of data.
The present invention solves this problem by providing a data structure known as a cube forest for use in a batch-load-then-read-intensively system. As a result of the present invention, the time to execute a bit vector query is significantly improved. Hierarchically split cube forests provide a method for efficiently duplicating information, and can be optimized to reduce update and storage costs. In summary, cube forests are most appropriate for read-intensive, update-rarely-and-in-large-batches multidimensional applications in an off-the-shelf (low cost) hardware environment.
According to the present invention, a method for structuring data with i key attributes (A1, . . . , Ai) and additional value attributes for storage in a memory includes the steps of a) defining a first forest F1 as a single node labeled A1; b) constructing a subsequent forest Fn according to the substeps of (i) creating a node n; (ii) copying a previous forest Fjxe2x88x921; (iii) making each tree in the previous forest Fjxe2x88x921 a subtree of the node n; (iv) creating another copy of the previous forest Fjxe2x88x921 and (v) defining the subsequent forest Fi as a union of the previous forest Fjxe2x88x921 and a tree rooted at the node n; and c) repeating step b) i-1 times until Fi is constructed, wherein the data structure is Fi. According to the present invention, the i key attributes can either be orthogonal attributes or nonorthogonal attributes.
Furthermore, the above method of the present invention is particularly advantageous when the paths in Fi represent keys of identifying data records.
Further, according to the present invention, an index structure for storing and indexing aggregates over at least i key attributes (A1, . . . , Ai) includes a plurality of i well-ordered trees, wherein a first tree includes one template node, and a next tree in the order includes a root template node having branches to duplicates of each of the previous trees, a total number of the template nodes is equal to 2nxe2x88x921, 2nxe2x88x921 of which are leaf nodes, and a collection of trees represents a template for building a set of search structures on a data table, and an index subkey is a catenation of attributes from a template tree root to a node.
Another aspect of the present invention includes a data storage device that includes a data structure that conforms to the rules of a full cube forest over key attributes (A1, . . . , Ai) and a means for storing an aggregation of values at each node of the full cube forest, in which one aggregate value for each subkey is represented by the node and which appears in the data.
Yet another aspect of the present invention provides a method for structuring data comprising the steps of: a) organizing the data as a cube forest by creating a cube forest template for the data; b) creating an index on each tree within the cube forest template; c) for each template, choosing a path from a root of the template to be a spine of the tree, wherein the spine defines a composite index, and the index has a plurality of keys which are attributes of nodes in the spine concatenated together, whereby the spine partitions the h-split template, creating several subtrees; d) establishing a spine for each subtree; and e) continuing steps a) through d) until all template nodes are in some spine.
Another aspect of the present invention provides a method for designing a cube forest data structure for a given cube forest template F, which has a plurality of trees, said method comprising the steps of: a) designing an index on each tree of the plurality of trees within the cube forest template; b) choosing a longest root-to-leaf path in the particular tree path to be a spine of the particular tree given an hierarchically-split tree template; c) defining a composite index to be the spine; d) defining a plurality of keys of the composite index to be a plurality of attributes of a plurality of nodes in the spine concatenated together; e) partitioning the tree template using the spine to create a plurality of subtrees; f) determining a spine for each subtree; and g) repeating the steps a) through f) until all template nodes are in at least one spine.
The above method is particularly advantageous if it includes the steps of: h) defining an i-th subkey, denoted ski, to be a prefix comprised of (a1, a2, . . . , ai) for every key (a1, a2, . . . , an) that is inserted into a tree, given an index that instantiates a spine on attributes (A1, A2, . . . , An); i) associating a set of subtree pointers with subkey ski if a template node corresponding to Ai has children other than Ai+1; j) associating an aggregate value with a particular subkey ski if a node corresponding to Ai is not aggregate pruned; k) defining an effective leaf for each subkey sk=(a1, . . . , ai) to be a place in the index where information associated with said each subkey is stored, wherein said information includes at least a subtree pointer and an aggregate value; 1) building a spine index from a B-tree; and m) placing an effective leaf for a subkey sk at a highest level in the B-tree where the subkey sk is a subkey of a separator in a node, wherein an i-th separator in a B-tree node is a key that indicates which keys can be found in the ixe2x88x921-th subtree as opposed to the i-th subkey.
In addition, the above method can include the step of placing the effective leaf at a predetermined separator position whose prefix is sk if there is more than one such separator. In this case, the predetermined separator position can either be the rightmost separator, or the leftmost separator.
According to another aspect of the present invention, a method for querying a cube forest for a single aggregate includes the steps of: a) specifying the type of query by stating how far down a hierarchy a value of an attribute is specified; and b) generating a query plan by finding a path in the cube forest template that starts at a root, and includes a node for every attribute value specified in the query.
In the above method, it is particularly advantageous if the step b) of generating the query plan by finding a path further includes selecting a shortest path if there are more than one path that starts at a root and includes a node for every attribute value specified in the query. In this case, the path found in step b) represents the query plan.
In the above method, one particularly advantageous embodiment includes the step of transforming the path into a query plan that contains detailed instructions for a generic query plan executer. In this case, this transformation includes the steps of: beginning at a top of the path and walking to an end of the path; and creating, for every node, an entry that specifies what to do at that node, including [dimension, attribute, value, and operation], wherein if the attribute value is specified in the query, the instruction is [dimension, attribute, value, single], otherwise the instruction is [dimension, attribute, *, sum].
In the above method it is particularly advantageous if the step b) of executing the query plan comprises the substeps of: (i) matching a superkey of an index with a maximal segment of the query plan, such that the maximal segment starts with a first key in the index""s superkey; and (ii) summing, for each effective leaf that matches the values specified by the query plan segment, a result and returning said result, wherein if the query plan segment is a suffix of the query plan, then returning a stored aggregate, else making a recursive search.
According to yet another aspect of the present invention, a method for querying a cube forest for a group-by aggregate includes the steps of: a) specifying the type of query by stating how far down a hierarchy a value of an attribute is specified; and b) generating a query plan by finding a path in the cube forest template that starts at a root, and includes a node for every attribute value specified in the query.
It is particularly advantageous if the above method further includes the step of retrieving values for a group-by query, which can be performed by creating a query plan according to the steps for creating a query plan for a single-value retrieval, except distinguishing between xe2x80x9csinglexe2x80x9d and group-by.
It is particularly advantageous if the above method further includes the step of upon reaching a terminal effective leaf, determining which result aggregate the answer should be added to.
According to the above method of the present invention, range queries are performed according to the same steps above, except that high and low values are attached to attribute selections, and the range of the search is extended appropriately.
According to yet another aspect of the present invention, a device for querying a cube forest for a single aggregate includes an input/output device and a processor. The input/output device specifies the type of query by stating how far down a hierarchy a value of an attribute is specified. The processor is coupled to the input/output device and generates a query plan by finding a path in the cube forest template that starts at a root, and includes a node for every attribute value specified in the query. In this case, the processor generates the query plan by finding a shortest path if there are more than one path that starts at a root and includes a node for every attribute value specified in the query, wherein the path found represents the query plan. Furthermore, the processor transforms the path into a query plan that contains detailed instructions for a generic query plan executer by:a) beginning at a top of the path and walking to an end of the path; and b) creating, for every node, an entry that specifies what to do at that node, including [dimension, attribute, value, and operation], wherein if the attribute value is specified in the query, the instruction is [dimension, attribute, value, single], otherwise the instruction is [dimension, attribute, *, sum]. In addition, the processor executes the query plan by: a) matching a superkey of an index with a maximal segment of the query plan, such that the maximal segment starts with a first key in the index""s superkey; and b) summing, for each effective leaf that matches the values specified by the query plan segment, a result and returning said result, wherein if the query plan segment is a suffix of the query plan, then returning a stored aggregate, else making a recursive search.
According to another aspect of the present invention, a device for querying a cube forest for a group-by aggregate includes an input/output device specifying the type of query by stating how far down a hierarchy a value of an attribute is specified, and a processor. The processor generates a query plan by finding a path in the cube forest template that starts at a root, and includes a node for every attribute value specified in the query, by finding a shortest path, if there are more than one path that starts at a root and includes a node for every attribute value specified in the query, wherein the path found represents the query plan, and by transforming the path into a query plan that contains detailed instructions for a generic query plan executer. This is accomplished according to the steps of: (i) beginning at a top of the path and walking to an end of the path; and (ii) creating, for every node, an entry that specifies what to do at that node, including [dimension, attribute, value, and operation], wherein if the attribute value is specified in the query, the instruction is [dimension, attribute, value, single], otherwise the instruction is [dimension, attribute, *, sum]. In this case, the processor executes the query plan according to the steps of: a) matching a superkey of an index with a maximal segment of the query plan, such that the maximal segment starts with a first key in the index""s superkey; b) summing, for each effective leaf that matches the values specified by the query plan segment, a result and returning said result, wherein if the query plan segment is a suffix of the query plan, then returning a stored aggregate, else making a recursive search; c) retrieving values for a group-by query by creating a query plan according to the steps for creating a query plan for a single-value retrieval, except distinguishing between xe2x80x9csinglexe2x80x9d and group-by; and d) upon reaching a terminal effective leaf, determining which result aggregate the answer should be added to, wherein range queries are performed according to the same steps above, except that high and low values are attached to attribute selections, and the range of the search is extended appropriately.