1. Field of the Invention
The present invention relates generally to databases and, more specifically, to performing join operations on a distributed system.
2. Description of the Background Art
As distributed query processing systems, such as distributed databases, are used in complex network environments, it becomes necessary to carefully account for the amount of data shifted between individual processing modules in the distributed system. Large queries sent over a busy network can result in millions of rows of data being returned over a potentially saturated network connection, slowing down processing of the query.
In a distributed query processing system where a first processing module and a second processing module within the system each have a set of tables of a database, it is often optimal to maintain only a local set of tables at each processing module location. For example, first processing module may have access to tables A, B, and C, and second processing module might have access to tables D, E, and F, but neither has copies of the other's tables. This is typically done due to the cost of synchronizing updates to tables shared between the first processing module and the second processing module.
However, for an operation to retrieve data from two or more tables, where at least two of the tables are located at separate processing modules from each other, it is necessary to somehow retrieve the data from either one or both of the modules to a central location in order to perform the operation. A join operation, which requires the combination of data from two or more tables, is such an operation.
Previous approaches to this problem, as regarding join operations, include the use of a merge join. In a merge join, qualifying rows (i.e., only those rows needed for the join operation) are retrieved from a remote server to a local server, and the join operation performed at the local server. While this is acceptable if there are few qualifying rows, it becomes an expensive operation the larger the number of rows retrieved from the remote server. Other approaches involve the use of nested loops to obtain data from remote tables, where each iteration of the loop requires a scan of the remote table. This approach suffers due to the cost of starting scan operations over a network.
Accordingly, what is desired is a means for improving a join operation where retrieving data rows from a remote server is expensive.