Present invention embodiments relate to improving efficiency in executing database operations, and more specifically, to improving efficiency of join operations on objects contained in a distributed database.
In a data warehouse, the largest consumers of processing resources are GROUP BY and JOIN database operations. For distributed databases typically implemented in data warehouses, each partition of the distributed database performs a hash join, which has several associated costs. First, hash tables are often sparse, which wastes RAM (random access memory). Additionally, hashing does not always produce unique (one-to-one) values, so the hash table must store key values in a manner by which collisions can be detected and/or avoided. Hash probing is also computationally expensive; it typically involves hash computation, random memory access even when the keys being probed are otherwise correlated, and key verification.
Performing JOIN operations on a primary key or other uniquely-valued column can be made more efficient by using direct lookup associative arrays where each key is associated with an index. Such an arrangement is very efficient when the join key values are dense or almost-dense. However, in distributed databases, each database partition may have only a sparse subset of the join keys thus defeating potential efficiency improvements of a direct lookup arrangement. Thus, ongoing research and development efforts seek to optimize JOIN performance in partitioned database implementations.