In a distributed database setting, the processing of queries often require distributed “join” operations be performed. It is important that the distributed join operations be performed efficiently because joins are expensive in terms of communication costs. As a result, Bloom filters are used to compress the join-related attributes and thereby reduces the required communication bandwidth. With Bloom filters, instead of sending the actual data, throughout the distributed system, a compressed form of the information, containing just enough information to test set membership, is distributed among the nodes participating in processing a query. Thus, Bloom filters are used in some database join operations to improve performance by obtaining key values from one table and using them to discard unqualified data from other table(s), thereby reducing the data scanned, joined and communicated within the distributed system.
Although use of Bloom filters conceptually improve performance relative to the same processing without use of a Bloom filter, they nevertheless also consume memory and need to be communicated among the nodes in the distributed database system involved in processing a query.
Typically, a Bloom filter is centrally built and then distributed to all of the nodes participating in the processing of a query. However, when a query is processed for a large analytic data set, the size of Bloom filters will also be large. Moreover, if that large analytic data set is part of a large distributed database system, those large Bloom filters will need to be communicated to all of the nodes participating in the processing of the query, which can result in use of significant bandwidth and communication network latency. This is unacceptable for large analytic queries in mission critical systems where performance is critical.
In order to reduce the adverse bandwidth and latency effects caused by sending a centrally built Bloom filter, some approaches have taken to breaking up the Bloom filter into pieces that are later merged at the individual nodes. However, this approach can have problems as well because, at each node, multiple threads participate in the merge of the Bloom filters from each node to create the complete Bloom filter so the nodes must coordinate their sending and receiving of their Bloom filter components with each other, causing performance issues, and, within a node, some form of mutex locking mechanism must be used to prevent the threads of the node from writing to the same part of the Bloom filter at the same time, causing more localized adverse performance issues.
Thus, there is an ongoing technological problem involving transferring and constructing Bloom filters at relevant nodes for use in connection with a database join operation.