To fully utilize the computing power of a multi-processing system, a large task may be divided into smaller tasks (“work granules”) that are then distributed to processes (“slave processes”) running on one or more processing nodes. Each node in a multi-processing system may contain multiple processors and multiple concurrent processes. Thus, with parallel execution, multiple parallel server processes work together simultaneously to complete the large task. For example, the task may be the processing of a single SQL statement.
A SQL statement comprises either a query or a combination of a query and data manipulation operations to be performed on a database. The query portion and the data manipulation operations are herein referred to as “operations”. An “operation” performed in a parallel execution environment is hereafter referred to as a “parallel operation”.
The number of slave processes assigned to perform a single parallel operation is referred to as the degree of parallelism (DOP) for that operation. Thus, the higher the degree of parallelism, the more slave processes are assigned to work on the operation.
A task may be adapted for a parallel execution environment in order to use as many slave processes as are available in the multi-processing system. However, the process of breaking the task up must be performed efficiently or the cost of breaking up the task into work granules and distributing the work granules to the slave processes may exceed the benefit gained by performing the operation in parallel.
The task of selecting rows that contain a particular value from a database table may be adapted for a parallel execution environment by having the task broken up into work granules, where the work granules correspond to “partitions” of the database table. Each partition of the table has a subset of the rows in the table. The process of dividing a table into a set of partitions is referred to as “partitioning” the table. In another approach, the task may be broken up for a parallel execution environment by simply assigning each slave process a contiguous block of data by specifying the row identification range associated with each contiguous block of data.
One type of partitioning is hash partitioning. According to hash partitioning, one or more values from each row of the database table that is to be partitioned are applied to a hash function to produce a hash value. A separate partition or hash bucket is established for each possible hash value produced by the hash function, and rows that have values that hash to a particular value are stored within the partition that is associated with that hash value.
Hash partitioning increases the efficiency of processing certain types of queries. For example, when a query selects all rows that contain a particular value in the column that is used to perform the hash partitioning, the database server can apply the value in the query to the hash function to produce a hash value, and then limit the scan of the table to the partition that corresponds to the hash value thus produced.
For example, assume that a join is to be performed between two equi-joined tables, A and B, on a cluster of two nodes X and Y with a DOP of 8. Assume further that, to achieve the DOP of 8, 4 slave processes on each nodes X and Y are assigned to the task. If a hash-join method is selected to accomplish the join, then the rows from tables A and B are distributed into 8 hash partitions. However, the cost of redistributing the rows into hash partitions may be computationally expensive.
Assume that each of the eight slave processes is responsible for processing the rows of one of the eight hash partitions. The distribution of the rows from Tables A and B into 8 hash partitions may involve a significant amount of inter-node communication between the two nodes. For example, a slave process from node X may read a row from table A, determine that the hash partition for that row is assigned to a slave on node Y, and send the row to a hash partition that is located at node Y. The drawback of inter-nodal communication is that inter-nodal communication is typically computationally expensive and high latency relative to intra-node communication.
FIG. 1 is a block diagram that illustrates a hash join on a two-node cluster with a DOP of 8. The two nodes in the cluster are represented by node 102 and node 104. Node 102 has 4 slave processes 102a, 102b, 102c, 102d. Node 104 has 4 slave processes 104a, 104b, 104c, 104d. 
The slave processes 102a–d and 104a–d read rows from Tables A and B and send them to the appropriate hash partition. In the example shown in FIG. 1, there are eight hash partitions. Four hash partitions 102e–h are being handled by node 102. Four hash partitions 104e–h are being handled by node 104.
Assume that node 102 is able to directly access Table A and that node 104 is able to directly access Table B. In order to perform an equi-join between Tables A and B using the hash-join method, the rows from Tables A and B are hashed by applying the same hash function to both Tables A and Table B. Slave processes 102a–d read the hashed rows from Table A and send the hashed rows to the appropriate hash partition. Similarly, slave processes 104a–d read the hashed rows from Table B and send the hashed rows to the appropriate hash partition.
In FIG. 1, for the purpose of simplicity, only slave process 102c of node 102 and slave process 104b of node 104 are shown as sending hashed rows to hash partitions 102e–f, and 104e–f. As indicated in FIG. 1, distribution of hashed rows into hash partitions may involve inter-nodal communication when each of the slave processes 102a–d, and 104a–d is performing the distribution of the hashed rows from Tables A and B respectively.
An alternate method of performing parallel execution of the above equi-join operation is by using pre-existing partitions of the objects. For example, assume that Tables A and B have pre-existing partitions as shown in FIG. 2. FIG. 2 is a block diagram that illustrates a database system in which both tables of the join operation have pre-existing partitions. The rows from each pair (one partition from each table) of the pre-existing partitions may be sent to a separate slave process for performing the equi-join operation. The pre-existing partitions of an object are herein referred to as “static partitions”.
In FIG. 2, Table A 202 and Table B 220 have identical static partitions based on “productid”. The static partitions of Table A 202 are productid H1 204, productid H2 206, productid H3 208, and productid H4 210. The static partitions of Table B 220 are productid H1 222, productid H2 224, productid H3 226, and productid H4 228.
Productid H1 204 holds all rows in Table A that have productid values that fall in a particular range or hash to a particular hash value. Productid H1 222 holds all rows in Table B that have productid values that fall into that same particular range or hash to that same particular hash value. Hence, in a join between Tables A and B in which productid is the join key, rows in H1 204 can only possibly join with rows in H1 222, and visa versa. Consequently, productid H1 204 and productid H1 222 make a “partition pair”. Similarly, H2 206 and H2 224 make a partition pair; H3 208 and H3 226 make a partition pair; and H4 210 and H4 228 make a partition pair.
The rows that belong to a partition pair may be sent to a slave process for performing the parallel join operation. When the parallel join operation that has to be performed joins the Tables A and B based on the productid key, there is no need to re-distribute the data from Tables A and B into hash partitions. Thus, by using the static partitions of Tables A and B, the overhead associated with distributing the data into subsets for distribution to slave processes may be avoided.
Parallel join operations in which the joined objects are partitioned in an identical manner based on the join key, where the data is divided and distributed based on static partitions, are referred to herein as full parallel partition-wise joins.
The drawback of using static partitions of objects in parallel processing is that the DOP is limited by the number of static partitions of the objects because only one slave process is assigned each pair of static partitions. For example, tables A 202 and B 220 in FIG. B are each divided into four partitions, so the maximum DOP that may be used in a join operation between the two tables is four, even though resources are available to achieve a much higher DOP.
Based on the foregoing, there is a clear need for a mechanism for increasing the degree of parallelism without incurring overhead costs associated with inter-nodal communication for performing parallel operations.