A distributed data warehouse includes a plurality of distributed database engines. The engines process database instructions against tables for which the engines are assigned to satisfy queries and/or reports. The engines are often clustered together on one or more network nodes (processing devices). When an engine fails (for whatever reason), the tables, or portions of the tables, assigned to the failing engine has to be picked up by another one of the engines.
The engines are clustered together in clusters of engines. Currently, tables associated with that engine are spread out (balanced out) over remaining engines in that cluster. So, when one engine is down, the failover processing ensures that the tables for the failing engine remain online and accessible for queries and/or reports by maintaining copies of the tables for failover support on the remaining engines. However, when two or more engines go down within a single cluster, the system is taken down because access to the data of the tables cannot be guaranteed.
To reduce availability issues, a conventional approach has been to limit the size of the cluster to just two engines. This also improved performance because when any data is manipulated on a primary engine, the data has to be reflected on the fallback engine. When more than 2 engines are in a cluster, the primary engine buffers data based on destination to the fallback engine. So, when there are more than 2 engines in a cluster, the fallback engine receives data from multiple buffers from different primary engines within that cluster, and the fallback engine has to switch between them. This causes significant Central Processing Unit (CPU) overhead on the sender side (primary engine) and Input/Output (I/O) overhead on the receiving engine (fallback engine).
However, wherein there are just two engines this forces re-clustering of the system for system expansion situations. For instance, when a single node clique (shared resource) is added to the system, the engines in the new clique cannot form a cluster by themselves because it causes down time when the clique goes down.
Therefore, there is a need for improved fallback processing within a distributed data warehouse that is not restricted to just two processing engines in a single cluster.