One of the most basic operations performed against data in a database is the join operation. A join is used to combine multiple tables, so data from those tables may be selected in a query. A query that joins two tables specifies how the tables are joined. For example, a query may specify that rows in a first table only join with rows in a second table when the rows in the first table have a value in a particular column that matches the value in a particular column of rows in the second table. The column that contains the values that determine which rows of the first table join with which rows of the second table is referred to as the “join” column.
In a database cluster, the work of combining data from multiple tables and evaluating the conditions used to join the tables may be distributed across multiple nodes. Each node works on a small subset of data individually, and then relays the results to a single node. For example, assume a query joins a first table with a second table. The first table has rows A, B, C, and the second table has rows D, E, F. To distribute the work required by this query, one node in the cluster compares row A with rows D, E, F; another node compares row B with rows D, E, F, and a third node compares row C with rows D, E, F. Once all of the possible combinations of rows are evaluated, a single node aggregates the results and responds to the query.
A partition-wise join (PWJ) significantly reduces the work required to perform a query that joins multiple tables. Before evaluating the conditions of a query that joins two tables, the joined tables are separated into sub-tables, referred to as partitions, based on the values of the join column. Each partition has rows containing a set or range of values for the join column. When a query joins the two tables based on the values from the join column, the work for the join query is distributed based on partition. Each node performs a smaller join on a partition from each table containing the same set or range values, rather than comparing every row in one table against every row in another table.
For example, assume that each of the two tables from the previous example is partitioned into three partitions, and that “DATE” is the join column. Under these circumstances, each of the three partitions corresponds to a range of dates. For example, the first table may be divided into three partitions T1P1, T1P2, T1P3 that correspond to date ranges Jan. 1, 1980 to Dec. 31, 1989, Jan. 1, 1990 to Dec. 31, 1999, and Jan. 1, 2000 to Dec. 31, 2009. The second table may be divided into three partitions T2P1, T2P2, T2P3 that correspond to the same three date ranges Jan. 1, 1980 to Dec. 31, 1989, Jan. 1, 1990 to Dec. 31, 1999, and Jan. 1, 2000 to Dec. 31, 2009. For the purpose of explanation, it shall be assumed that rows A, B and C have dates that fall into partitions T1P1, T1P2 and T1P3, respectively, and that rows D, E, and F have dates that fall into partitions T2P1, T2P2 and T2P3, respectively.
When the two tables are joined based on the DATE column, the work distributed to each node becomes significantly less. Specifically, after the partitioning, both row A and row D are part of the same date range; row B and row E are part of the same date range; and row C and row F are part of the same date range. Using a partition-wise join, one node performs a join between T1P1 and T2P1 (which only requires row A to be compared with row D). Another node performs a join between T1P2 and T2P2 (which only requires row B to be compared with row E). Finally, a third node performs a join between T1P3 and T2P3 (which only requires row C to be compared with row F). Then the results are aggregated. Compared to the previous example, each node performs a third of the work in order to provide a result. Examples of partition-wise joins may be found in U.S. Pat. No. 6,609,131 filed Sep. 27, 1999, “PARALLEL PARTITION-WISE JOINS,” filed by Mohamed Zait et al., the entire contents of which is hereby incorporated by reference as if fully set forth herein.
In a typical partition-wise join, the work of performing a smaller join between two partitions may be assigned to any node in the cluster. Each node is presumed to have equally-fast access to the same on-disk copy of the partitioned tables, and any node may access any partition of any table. However, equally-fast access to data is not true in systems that pre-load entire database objects, or portions thereof, into volatile memory before a query requests the data. Such systems are described, for example, in U.S. patent application Ser. No. 14/377,179, entitled “Mirroring, In Memory, Data From Disk To Improve Query Performance”, filed Jul. 21, 2014, the contents of which is incorporated herein in its entirety. In such systems, assigning partition-wise join operations to nodes based on the assumption that all nodes have equally-fast access to all data may often lead to sub-optimal results.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.