The present system relates to Multi-store Analytics Execution Environments with Storage Constraints.
A database stores information as records, and records are stored within data pages on disk. The physical design of a database refers to the configurable physical data layout and auxiliary data structures, such as indexes and materialized views. The physical design can greatly affect database processing speed, and designs are tuned to improve query performance. The physical design tuning problem can be stated informally as: “Given a workload W and storage budget b, obtain a physical design that minimizes the cost to evaluate W and fits within b”. W is a workload that may contain query and update statements. The storage budget is the storage space allotted for the physical design. Physical designs can include secondary data structures to enable faster data access such as indexes and materialized views, and may also include data partitioning strategies that affect the physical data layout. Commercial tools to automate this process exist in major DBMS such as IBM DB2's Design Advisor] and MS SQL Index Tuning Wizard that recommend beneficial physical designs.
In “Optimizing analytic data flows for multiple execution engines”, a single data flow is optimized across multiple execution engines by utilizing their unique performance capabilities to reduce total execution time of the flow. Input data is ‘pinned’ to a store, i.e., where it currently resides, and output data is pinned to the DW as a reporting requirement. They consider data shipping and function shipping, and these decisions are affected by the availability of data and functions on each of the engines. Data movement cost is modeled as the network cost. To solve the problem they model it as a state space using a binary-valued matrix. Each row represents an operation in a data flow (in sequence) and each column represents an execution engine. A ‘1’ is used to indicate if an operation is available on an engine, and zero if the operation is not available. They enumerate all valid paths top-down through the matrix. Valid moves in creating a path are allowed straight down or diagonal, but not allowed to move through a ‘0’ cell. After enumerating all valid paths, they prune extensively to eliminate non-preferred paths, based on their desired heuristics. Heuristics include 1) prefer a certain engine (favor bit) for a flow, 2) disallow too many data movements between engines, 3) prefer engines that have more of the functions required for a data flow (considering functionality is not identical on all stores), and several other heuristics. They also require all flows to terminate in the DW. However, this approach does not provide a direct solution for our problem because it only considers optimizing for a single flow (query). Our proposed work will optimize for a workload of queries. We will leverage multiple engines by allowing opportunistic views to be moved between stores during query processing if moving the view will reduce total workload cost. Since their solution consider only a single query, it can be sub-optimal when considering multiple queries. A straightforward example of why their solution can be sub-optimal is the following. Consider optimizing a single query, and a decision is made not to move a materialized view from engine A to engine B due to the cost of movement being too high for the single query to benefit. However, if it had considered all queries in the workload, moving the view could result in great benefit to other queries, potentially outweighing the data movement cost.