Large modern database systems often exploit parallelism in database operations. For example, a join between two tables T1 and T2 might be parallelized by apportioning some portions of the join operations to one execution unit and apportioning a different set of operations to a second (or Nth) execution unit. As an example, a join between two tables T1 and T2 based on some equality predicate (e.g., T1 .x1 =T2.x2 ) typically might involve distribution of portions of one or both tables to execution units to perform the join, and, a given execution unit performs comparisons on the join key to find matching rows (e.g., matching based on the equality predicate for a particular dimension).
There are many possible distribution methods for determining how to apportion the tables to a number of parallel execution units. And, the distribution method selected can greatly affect the performance of the parallelized join. In legacy systems, the distribution method is selected a priori during a compile phase (e.g., by a compiler or optimizer) in advance of apportioning the join operations to the execution units. Such a legacy compiler or optimizer tries to estimate the performance of the parallelized join using several distribution methods, and using the estimates, the legacy compiler or optimizer tries to minimize the aggregate cost of performing the parallelized join by selecting the fastest or cheapest or best distribution method. For example, some legacy systems perform estimations that consider the sizes of the tables to be joined, thereby avoiding unnecessary costs of distributing and scheduling.
Unfortunately, in many situations, estimates can differ significantly from actual performance of the parallelized join. Thus, the optimizer might select a distribution method that proves to be ill-selected. In some cases it is possible that the optimizer might select a distribution method that results in a significant workload being performed by only one execution unit, thus leading to poor utilization of execution units, and possibly heavy performance penalties within the system.
Legacy solutions to this problem have focused on improving the estimates so that the best distribution method is picked at compile time. Unfortunately, as earlier indicated, there are many situations in which it is not possible to select the best distribution method until after execution has begun, and legacy system do not implement techniques that are able to switch to a different distribution method once execution of the join commences. Therefore, there is a need for an improved approach.