1. Field of the Invention
This invention relates to computing systems and more specifically to distributed query processing in data storage systems.
2. Description of the Related Art
Database systems managing large amounts of data may distribute and/or replicate that data across two or more machines, often in different locations, for any of a number of reasons, including security issues, disaster prevention and recovery issues, data locality and availability issues, etc. These machines may be configured in any number of ways, including as a shared resource pool, such as in a grid computing architecture.
Some commercially available database systems offer distributed query processing for use on such resource pools. With traditional distributed query processing, a query is divided in local and remote parts. A single query processor builds a query for the local part and uses special interfaces and/or middle-ware to allow the local node to access remote nodes. In such systems, the remote node must be available throughout the processing otherwise the system hangs. Changing a remote node, in these systems, requires a static reconfiguration of the system. Consequently these systems may scale only to a small number of nodes, on the order of four to sixteen.
In distributed query processing, a query submitted by a client using a declarative query language (e.g., XQuery, SQL, OQL, etc.) is transformed into a procedural query execution plan, i.e., a network of relational algebra operators. The query itself specifies only what to retrieve/examine/manipulate, whereas the query execution plan specifies a precise order in which operations are to be applied.
The general workflow in query processing typically involves the following four steps:                generate a query execution plan (sometimes referred to as compilation)        instantiate the plan (i.e., allocate all resource needed for execution)        execute the plan (i.e., process the data)        release the resources        
Query plan compilation typically involves both a translation operation (e.g., translating a query from a declarative query language into an internally executable format) and an optimization operation. The query optimizer's choice of an execution plan is not only dependent on the query itself but is strongly influenced by a large number of parameters describing the database and the hardware environment. Modifying these parameters in order to steer the optimizer to select particular plans is difficult in these systems since this may involve anticipating often complex search strategies implemented in the optimizer.
Conventional database systems that offer “remote query capabilities” generate query plans that are always executed locally. These plans may be cut up and plan segments may be identified for execution on remote nodes. However, the segments that are to be executed remotely are removed from the plan and encoded in terms of query language expressions that rely on operators functioning as interfaces to the remote nodes.
Cost-based query optimizers typically consider a large number of candidate execution plans, and select one for execution. The choice of an execution plan is the result of various, interacting factors, such as database and system state, current table statistics, calibration of costing formulas, algorithms to generate alternatives of interest, and heuristics to cope with the combinatorial explosion of the search space.
Query compilation is typically very expensive (in particular, the optimization operation). Therefore, database systems often cache compiled query plans and re-use them if possible. The re-use of these plans is limited by their flexibility. If the plan relies in a stringent way on volatile information, i.e., information that changes between compilation and execution, the plan cannot be re-used, and must be recompiled. For example, in typical distributed query processing systems, query plans are heavily dependent on the location of data on particular nodes. If one or more nodes in the system changes (e.g., if nodes are often added or removed, or if they fail) the cached query plans may be invalidated and may need to be recompiled at significant cost.