A common operation on a relational database is to join two tables (relations): A={a1, a2, . . . , an} and B={b1, b2, . . . , bm}, where |A|≦|B|. Notionally, this is achieved by firstly forming the Cartesian product of the two tables to produce a table T=A×B={aibj|aiεA, bjεB}. Thus if A contains I tuples and N attributes and B contains J tuples and M attributes, T contains (I*J) tuples and (N+M) attributes. Secondly, we apply the join condition θ to each row in T, eliminating rows that do not match to form a result table J={Ti|θ(Ti)}.
As an example of a join operation, table A may represent a listing of customers, complete with customer name, number and address, while table B may represent a listing of pending orders, with order number, product, and customer number. In this example, the join operation may match customer number from table A and B to provide a table of orders, with the customer name and address being provided for each order. Note that in this example the join condition (predicate) is simple equality of customer number between the two tables, but in other cases a more complicated predicate may be used. For example, the customer table may include a current balance value and a credit limit value, while the order table may include an order value. In this case, in addition to the join on customer number, the predicate may include the condition that the order value plus current balance is less than the credit limit.
When forming T, each row from A must be paired with each row from B exactly once. If a row from A is paired with a row from B more than once, we will get duplicate rows in T and therefore maybe in J. If a row from A is not paired with a row from B, then it will be missing in T and therefore in J.
As recognised in “Database Systems” by Connolly, Begg and Strachan, Addison-Wesley, 1996, ISBN 0-201-422787-8 (see especially page 104), a join operation is one of the most difficult operations to implement efficiently in a relational database management system (RDBMS), and this can lead to performance problems. In a single processor system, the join operation can be implemented in a relatively straightforward manner, for example by iteration over the rows of table A and table B. However, the use of such a single processor system may not provide satisfactory overall performance, especially in situations where tables A and/or B may be very large.
The WX2 product from Kognitio Ltd is based on the use of many parallel processors (nodes) to provide greatly improved performance for database operations. Distributing database operations across many processors provides a very powerful, scalable and cost-effective solution. However, when using a distributed or parallel solution for performing join operations, great care is required to ensure that each row from the first table is paired once and only once with each row from the second table.
One approach for a distributed join operation is to use a replicated distribution for the smaller table, and a partial distribution for the larger table. In the replicated distribution, the smaller table is copied in its entirety to each node: N1, N2, . . . Nn. In the partial distribution, each row of the larger table is sent to one (and only one) node. The allocation of rows to nodes may be made via any suitable mechanism—e.g. on a random basis, a round-robin scheme, or using a hash calculated for each row. If the different nodes are of (approximately) equal processing capability, then it is generally desirable to assign approximately the same number of rows from the larger table to each node. In this case, using a hash might not give a uniform assignment, depending on the data, and so may not be appropriate in such circumstances.
Suppose table A is replicated and table B is partially distributed. Each row bi is on one (and only one) node, and can be joined to every row in A because a copy of A is on each node. This then allows each pair of rows from table A and table B to be joined exactly once (as required). To select all the rows from the partial distribution (table B), we would select all the rows from each of the nodes. To select all the rows from the replicated distribution (table A), we would select all the rows from one of the nodes.
In the partial/replicated approach, each node contains |A| rows from A and |B|/n rows from B (where n is the number of nodes in the system). The processing requirements for this approach can then be represented as:                Time: O(|A∥B|/n)        Space on each node: O(|A|+|B|/n)        Total space: O(n|A|+|B|)        
In a conventional distributed system, each processing node is provided with local memory, and also has access to disk storage (whether local or shared). Since data access from memory is much faster than from disk, it is highly desirable for each node to be able to hold the replicated version of table A and the partial version of table B in memory. If O(|A|+|B|/n) exceeds the available memory due to the size of the (partially) stored table B, then this can be addressed by adding more nodes to the system (increasing n decreases |A|+|B|/n). On the other hand, if no more nodes are available, or if the available memory capacity is exceeded due to the size of table A, then disk storage can be used to hold the additional data, although this will have a significant impact on processing efficiency. This is especially a problem for in-memory databases that are optimised for tables that are small enough to fit in memory to avoid disk or inter-node I/O where possible.
One way to reduce overall storage requirements is to provide shared memory for at least some subsets of the nodes. The nodes in the subset can then share the replicated table rather than having multiple copies (one for each node). Assuming that the shared memory access is not slower than normal memory access, this reduces overall storage requirements. However, this approach is only applicable for systems with such a shared memory architecture.
Another possible approach for addressing storage issues is by using column indices for the tables to perform the join operation. Thus in the example mentioned above involving a customer table and an order table, if there were an index over customer number for both tables, these indices could be used instead of the whole tables to identify which rows (tuplets) to include in the output table. However, the use of indices brings its own problems, such as increased storage requirements, and also limited flexibility (since indices are usually only provided for selected attributes).
A special case arises for equi-joins, in which the join condition θ is a conjunction comprising one or more conditions of the form E1=E2, where E1 is an expression using columns only from table A and E2 is an expression using columns only from table B (the customer-order example given above is an equi-join). In such circumstances, it is possible to use a Hashed-{Fi} distribution across both tables to provide a better allocation of the data sets across the nodes. In particular, assuming a good hash function H, a table T and a set of expressions {Fi} using columns of T, each row from T may be sent for the join operation to the node calculated as a function of H({Fi}), for example to node Nh, where h=H({Fi}) mod n+1.
Consider therefore the situation where table A is Hashed-{A1,A2} and table B is Hashed-{B1,B2}, and we wish to join them using the condition: A1=B1^A2=B2. For each pair of rows ai, bj where the join condition is true, the hashes H({A1,A2}) and H({B1,B2}) are equal. Consequently, both of these rows will be on the same node, thereby allowing the join to be detected and performed. Furthermore, these rows will not occur on any other nodes, and so each pair of rows will join exactly once as required.
As an example of this approach, consider the customer-order join mentioned above. In this case, the hashing would be performed on customer number for both tables, so that all rows having a given customer number would end up on the same processing node (irrespective of whether the row originated from the customer table or from the order table). Consequently, the join could be performed just by having each node compare the rows hashed to that particular node.
For an average case with a fairly uniformly hashed set of data, each node would be allocated approximately |A|/n rows from table A and |B|/n rows from table B. This leads to the following processing requirements:                Time: O(|A∥B|/n2)        Space on each node: O((|A|+|B|)/n)        Total space: O(|A|+|B|)Note that these processing requirements are advantageous compared to the partial/replicated approach. Furthermore, since table A is now distributed across the nodes as well as table B, tables of any size can be accommodated by the addition of further processing nodes.        
Although hashing both tables can be very effective, there are many situations where it does not produce such good results, or cannot be applied at all. For example, if the data does not hash uniformly, a few nodes will have a disproportionate number of rows, thereby reducing efficiency (since the remaining nodes are operating below capacity). Furthermore, such an approach does not scale well to large tables, since all the rows are in effect restricted to a small, fixed number of nodes. As an example, such a problem might arise for an equi-join with few (≦√n, where “√” denotes the square root operator)distinct values such as TRUE/FALSE or days of the week. In such circumstances, the hashed approach may actually provide a less effective solution than the partial/replicated approach described above.
Furthermore, the hashing approach is restricted to certain types of predicate, and cannot be used with other types of predicate such as ≦ or “between”. This is because with such a predicate, it cannot be ensured that rows that satisfy the predicate will both hash to the same node. For example, there may be a range of attribute values that satisfy a given predicate, and the hash may assign such a row to one of various possible nodes, depending on the particular value of the attribute within the range. Accordingly, this type of join must use the partial/replicated approach set out above, despite the problems that can arise, especially with very large data sets.