Database systems increasingly rely upon parallelism to achieve high performance and large capacity. Rather than relying upon a single monolithic processor, parallel systems exploit fast and inexpensive microprocessors to achieve high cost effectiveness and improved performance. The popular shared-memory architecture of symmetric multiprocessors is relatively easy to parallelize, but cannot scale to hundreds or thousands of nodes, due to contention for the shared memory by those nodes. Shared-nothing parallel systems, on the other hand, interconnect independent processors via high-speed networks. Each processor stores a portion of the database locally on its disk. These systems can scale up to hundreds or even thousands of nodes, and are the architecture of choice for today's data warehouses that typically range from tens of terabytes to over one hundred (100) terabytes of online storage. High throughput and response times can be achieved not only from inter-transaction parallelism, but also from intra-transaction parallelism for complex queries.
Because data is partitioned among the nodes in a shared-nothing system, and is relatively expensive to transfer between nodes, selection of the best way to partition the data becomes a critical physical database design problem. A suboptimal partitioning of the data can seriously degrade performance, particularly of complex, multi-join “business intelligence” queries common in today's data warehouses. Selecting the best way to store the data is complex, since each table can be partitioned in many different ways to benefit different queries, or even to benefit different join orders within the same query. This puts a heavy burden on database administrators, who have to make many trade-offs when trying to decide how to partition the data, based upon a wide variety of complex queries in a workload whose requirements may conflict.
Previous efforts have chosen partitions heuristically or have created a performance model separate from the optimizer. Heuristic rules unfortunately cannot take into consideration the many inter-dependent aspects of query performance that modern query optimizers do.
Accordingly, the present invention recognizes a need for a tool that can be used to automate the process of partition selection.