1. Technical Field
Present invention embodiments relate to partitioning data, and more specifically, to partitioning data within a shared data storage unit for parallel processing of database or other operations.
2. Discussion of the Related Art
Large scale analytics often require the processing power of multiple servers to compute results of complex queries in a reasonable amount of processing time. Several scale out architectures have been defined for this task, including massively parallel processor architectures. The massively parallel processor architectures are popular, and divide the processing of data across central processing units (CPUs) using various techniques, such as hash distribution or round robin. However, a major limitation in these approaches occurs in most real environments where data to be joined is not collocated on the same server in a server cluster, thereby requiring transfer of data among the servers. The resulting network traffic incurred (to collocate the required data for a join operation on the same CPU for processing) is prohibitive.
Some systems collocate join tables through a common hashing algorithm. However, this approach is limited to collocating data along a single dimension (where the data is not collocated along other dimensions). Accordingly, join operations along these non-collocated dimensions require use of broadcast or directed joins. These joins direct each server to send to every other server all rows from a hash partition by that server of one or more tables being joined, thereby being far less efficient.