An index is an ordered collection of numbers or character strings, or both, such that some numbers or character strings represent objects and other numbers or character strings represent information about these objects. For example, one form of index can be viewed as a table whose rows represent documents and whose columns represent attributes of these documents. Such a column in a table T can be referred to as T. attribute.
Joins are a class of data-processing operations that may be performed on indexes. Joins are used to match documents from different indexes by finding matching values of certain attributes of these documents. As an illustrative example, FIG. 1 shows two indexes A and B. In index A, documents represent customers and their attributes are surname, first name, and city. In index B, documents represent suppliers of certain goods and their attributes are company name and city. From indexes A and B, a table listing customers and companies in their home towns can be created. That is, the following SQL-like “join query” is evaluated:
SELECTA.surname, A.first_name, B.company FROM A, B where A.city=B.city
To evaluate this join, these two indexes are joined by join attributes A.city and B.city. The corresponding documents from the two indexes are merged by matching their values of the join attributes. This yields the table of values shown in FIG. 3.
Consider a distributed landscape in which indexes are hosted on separate machines. One problem is that the amount of network traffic required to compute such a distributed join may be the main factor limiting the performance achievable with a given join process. Conventional processes for computing the join may require network traffic proportional to the size of the join table. If, for two indexes, one index has N rows and the other index has M rows, then the join table may consist of as many as N*M rows, and has all the requested attributes from both tables.
The problem is exacerbated in situations where indexes are too large for a single machine. Such indexes may be split up and stored on different servers. To process join queries over such distributed indexes, it may be necessary to transfer even more data over the network than in the case where each index has its own host. What is needed is a join method that minimizes network traffic.