This invention relates generally to health monitoring of clusters of distributed server systems, and more particularly to monitoring and reporting on the health of components of a shared nothing distributed database cluster.
Large distributed processing systems, such as shared nothing databases, comprising a cluster of multiple, e.g., thousands of, database servers and storage units are employed by enterprises for storing critical data and for executing applications such as real-time transaction processing, etc. As such, system failures are costly, which necessitates that the systems have high availability. Accordingly, health monitoring systems and processes must quickly detect and report hardware, software and database faults and warnings (alerts) so that they can be promptly addressed to minimize system downtime.
Moreover, it is of utmost importance that the health monitoring systems and processes employed for monitoring system health avoid, or at least minimize, false negatives, i.e., failing to report an alert when it occurs. If something is broken and an alert is not sent, it is worse than reporting that something is failing when it is not. For example, if a disk fails and a RAID-5 mirrored system goes into a degraded mode, a notification must be sent as soon as possible so that the failed disk can be replaced. Otherwise, there is a risk that the system could lose a second disk and then go completely down. Therefore, it is important to report an alert promptly when a problem occurs.
To address this, some existing approaches employ duplicate, redundant hosts one of which actively manages the database, and the other of which is a backup. Health monitoring processes can run on each of the hosts. When the primary host fails, the monitoring process can be manually switched over to the backup host, which is costly and slow. In other approaches, both of the hosts simultaneously run health monitoring processes without communicating with each other. While this has the advantage of continuing health monitoring should one host fail, and may minimize false negatives, it has the disadvantages of requiring additional redundant hardware and of generating duplicated alerts which are costly to process.
It is desirable to avoid or minimize duplicated alerts since they create noise and confusion in the backend alert processing, cause additional load, and are costly to process. For instance, a large customer support center may receive alerts from thousands of customers which require a server farm just to handle the incoming load. If the number of alerts is doubled, the size of the server farm would have to be increased accordingly to handle the increased traffic. Furthermore, more sophisticated logic is required to deal with duplicated alerts in order to avoid generating duplicate support tickets each time an alert is received.
It is desirable to provide systems and methods that address these and other problems of known approaches to monitoring the health of systems, and it is to these ends that the invention is directed.