The invention relates generally to the field of query processing in a network of computational resources. More specifically, the invention relates to a method and a computer program product for formulating an integrated cost model to optimize the execution of the distributed query in a network of computational resources.
Data in an enterprise is stored in one or more heterogeneous formats at geographically separate locations. The disparate and geographically separate data sources in an enterprise can be integrated by using distributed computing technologies such as data grids. These technologies enable seamless integration of data sources. The integration is achieved through design and development of a distributed query engine.
Numerous approaches have been proposed that either reduce the communication cost or the response time. Some of the approaches that minimize communication cost implement the concept of ‘semi-joins’ to reduce the amount of data transferred to remote nodes during a join operation. The cost and benefit of semi-joins is estimated in between two relations referenced in the query and recursively the most profitable join is selected for query processing. Further, approaches that minimize response times utilize parallel processing techniques to achieve enhanced query optimization.
In the approaches that aim to reduce communication costs and response times, query evaluation is performed in three distinct phases. These phases include creation of a single-node plan, generation of parallel plan and site selection for plan execution. In the first phase, conventional query optimization plan is employed to determine the optimal single-node query plan. Subsequently, in the second phase, the single-node plan is split into parallel plans by introducing exchange operators in the single-node plan. The generated parallel plans are then allocated to different machines for execution. In the last phase, optimal query scheduling techniques are employed to minimize the communication cost and thereby improve query evaluation performance.
However, such approaches optimize the query in a distinct, independent and isolated manner. Moreover, since the three phases are considered in isolation, it results in sub-optimal plans. During the first phase, an optimal single-node plan is created without considering the node-level parameters, such as available memory, processing speed, and other resource-scheduling parameters. Therefore, the optimal plan generated in the first phase may be an inefficient query plan. Further, there is a lack of an integrated query processing method that considers the node-level and resource-scheduling parameters in all the three phases of query optimization. The challenge, however, lies in developing a distributed query processing engine, which can generate an optimal query execution plan to reduce query response times.
In light of the foregoing, there is a need for an integrated distribution query optimization model that includes node-level and database-related parameters. Moreover, there is a need for an integrated distribution query optimization model for an enhanced query response time.