As a consequence of diverse needs of business organizations in the current business scenario, many organizations and companies store their data in varied databases including, but not limited to, transactional databases, data warehouses, business intelligence and legacy systems. In order to access data from such heterogeneous databases, there is a need to integrate the data such that the data is easily and reliably accessible. Data Federation Systems enable seamless, flexible and virtual integration of distributed data sources in an on-demand and real time manner. A data federation system provides an integrated view of data from heterogeneous sources by offering a unified front-end for access to data. Such a system obviates the need for actually moving and copying the data. One of the core components of a data federation system is a Distributed Query Processing (DQP) engine. DQP involves querying distributed data by partitioning a query into several sub-queries and parallelizing their processing over multiple machines.
A basic DQP engine comprises of three main components: Query parser, query optimizer and query evaluation engine. Query parser is configured to perform semantic analysis in order to validate databases, tables and attributes involved in the query. After the parsing of query by the query parser, the query optimizer creates one or more sub-plans and schedules each of the sub-plans on query evaluators. The query evaluators on receiving the sub-plans, evaluates one or more operators specified in the sub-plan and sends the results back to the query optimizer. In case, query execution on any of the evaluator node fails, then the overall query execution stalls and the query has to be re-executed. In real-time computing systems with mission critical performance needs, re-executing the entire query which has been stalled can prove to be very expensive. Since the query has been stalled the first time, the number of resources required for successful execution of the entire query will be twice and the time taken for re-execution will also be twice. As a result, the cost of query execution would be double.
In light of the above, there exists a need for a failure recovery protocol that delivers high performance and high availability so that execution of a failed task is performed efficiently without the need to re-execute it completely.