Distributed joins can be used to join data from tables that are distributed horizontally among multiple nodes implementing a multi-node database management system. Traditional distributed join techniques exchange, over a network, all of the columns of tables that are required for the materialized result of a join operation, and hence cause the execution time of join queries to be dominated by the network time. This technique is widely-used and commonly referred as the “early-materialization” approach, where all required columns for a join are stitched together before being sent over the network to be assembled into the materialized join result.
The high amount of network bandwidth required to exchange data during an early-materialization approach for a join involving significant amounts of data results in a processing bottleneck. As such, data reduction techniques such as bloom filters are conventionally employed to reduce the amount of data exchanged. However such data reduction techniques generally affect the number of rows being exchanged during an early-materialization approach and still require a significant number of columns to be exchanged over the network.
Furthermore, early parallel database systems lay the foundation of distributed, parallel join processing. For example, in the Gamma project, row-based tuples are routed to processing nodes using hash-based split tables. Identical split tables are applied to both input relations and hence relations are distributed and partitioned row-by-row for independent parallel join processing. However, given the large amount of data shuffled between nodes in early systems such as Gamma (at times even including database data that is not required to be included in the join materialization), these systems suffer from the network data exchange bottleneck described above.
Also, the SDD-1 algorithm by Bernstein et al., introduced in the early years of distributed databases, aims to reduce network usage for distributed join processing. However, the algorithm is based on the idea that different database objects reside on different nodes as a whole, which is not the case in many modern distributed database management systems which generally horizontally partition database objects across multiple nodes of a system. Furthermore, the algorithm requires distributed execution rather than parallel execution of a distributed join operation, and also does not consider column-oriented storage of database data.
In terms of late materialization being performed by a single machine, Manegold et al. propose the use of cache-conscious radix-decluster projections for low selective joins to eliminate the random access memory costs arising during the materialization phase in a single machine. However, this solution is not applicable in modern distributed query processing systems, because, for data that is horizontally partitioned among the nodes of a shared-nothing system, various attributes of a tuple cannot consistently be directly accessed via record identifiers as required by this technique. Moreover, the technique does not address network bandwidth issues in a distributed system because upstream operators (utilized by the technique) require access to relevant attributes from the join result and, therefore, in order to apply principles of this technique to a distributed system, those attributes would need to be shipped to the corresponding processing nodes within the distributed system.
Also dealing with joins on a single machine, Abadi et al., analyze the trade-offs for materialization strategies in a column-oriented database system on a single machine and conclude that for operators with highly selective predicates it is almost always better to use a late-materialization strategy, and for joins, the right input table should be early-materialized in almost all of the cases. Since Abadi analyzes materialization techniques for a single machine, there is no consideration of the issue of network bottlenecking or of performing materialization on a distributed system.
Further, the Track Join approach focuses on reducing redundant data transfers across the network by generating an optimal transfer schedule for each distinct join key. However, this approach changes the skeleton of the main partitioned join-processing algorithm and requires intense computing in the CPU for each distinct join key. However, it does not reduce the footprint of individual tuples. Overall, this approach trades additional CPU work for the potential of less network data transfers.
Since the high amount of network bandwidth required to exchange data during an early-materialization approach for a distributed join operation involving significant amounts of data results in a processing bottleneck, it would be beneficial to reliably reduce the amount of traffic exchanged over the network during distributed join operations without causing an increase in CPU usage.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.