In a relational database management system (RDBMS), an intra-query partitioning is often used to reduce query processing time. Each partition (of the queried data) is processed by a worker thread and multiple worker threads process a query concurrently. Several partition attributes, such as the number of partitions and a key range or a page range of each partition, are determined by a query optimizer at a query compilation time. For multi-join queries, when several relations are joined by one or more types of join operators, the partition attributes are predetermined by the leading relations in the join sequence based on the available statistics and an estimated predicate filtering factor that the query optimizer relies on. In many cases, when a nested loop join (NLJ) operator is used, the leading relations in a query are the smaller ones, joined by larger relations and then further joined by larger or smaller relations.
Intra-query parallelism is used to break a query into subtasks and process them in parallel using different central processing units (CPUs) or input/output (I/O) threads to reduce query response time. Partitioning a multi-table join (multi-join) query can be performed on one or more tables that are involved in a pipelined process. For example, one existing approach includes partitioning on the first table's distinct keys or on physical locations on the disk. Deciding how, when and where to partition query operations can be made at query compilation and/or optimization time before query execution.
Based on previously gathered query object statistics, estimated filtering from query predicates and available system resources, decisions such as which tables are used for partitioning and how many partitions are generated, remain unchanged during the course of query execution. However, frequently, the partitioning decisions for multi-join queries are less optimal, which creates an impediment for obtaining good query performance. There exist problems in existing approaches such as, for example, unbalanced workloads for each sub-task, which can be caused by insufficient or infrequently refreshed database statistics (refreshing database statistics can be expensive).
Other disadvantages exist, such as, for example, a smaller number of partitioned working sets than the number of available tasks to fully utilize allocated system resources, which can be caused by insufficient database statistics, infrequently refreshed database statistics and imprecise filter factor estimation at compilation time.
In reality, the available statistics are often inaccurate due to frequent data updates or the lack of timely statistics gathering. In existing approaches, gathering very detailed and targeted statistics is either not supported by RDBMS or very expensive in a production environment even if it is supported. Without accurate and targeted statistics, the estimation of a predicate filtering factor and the estimation of subsequently qualified rows after each join step may not be accurate.
Such inaccuracy leads to two problems. One problem is that a partitioned working set is uneven in size if some partitions are relatively larger than others and, as a result, the larger partitions take a much longer time to process than the smaller partitions (that is, work imbalance and sub-optimal performance). A second problem is that the actual turnout of the leading relations may have only a small number of keys, pages or records that can be used for partitioning. Specifically, the number of partitions might be smaller than the number of available worker threads.
A finer grained partitioning in such a case cannot be created. Because the larger relations are joined later, the fan out working set, as a result, can be large and skewed. A few large partitions take a long time to be processed by a subset of the worker threads, with rest of the worker threads starving due to the lack of work.
For example, FIG. 1 depicts a sequence of multi-join query, where Ri (i=1 . . . 6) can be a base relation or a relation derived from sub-query. In FIG. 1, R4 is much larger than all other relations, and R1 is used for intra-query parallelism partitioning. Due to inaccurate estimation on filtering, R1 only generates two records that are used for partitioning. The consequence of this is that R4 has only two large partitions being processed by only two worker threads, even though the query engine may have more available central processing units (CPUs) and worker threads to handle this work.