The present disclosure relates to a method of testing a computer node and to the computer node itself.
In clustered computer systems comprising a network of computer nodes, it is known that upgrades can fail due to the system drive in a clustered node failing at the point at which it is rebooted to perform its part of the upgrade procedure. A system drive refers to the drive that contains the operating system for the node and does not contain stored data. This weakness is difficult for a computer system administrator to spot in advance because a drive will get into a state where it is going to fail the next time the system is rebooted, but won't display any degraded function during normal running. The computer node can then remain this “faulty” state for any length of time, only exhibiting the failure at the point at which a reboot is performed. An upgrade is very often the only time a node is actually rebooted and this is a particularly bad time for a system drive failure to occur, because it leaves the clustered computer system exposed.
Nodes may be designed with dual redundant hard drives to help counter this problem, but this is only a complete solution if the chance of both drives failing on the same reboot is significantly improved over the chances of one drive failing on a single reboot. Theoretically, the chance of both drives failing on the same reboot=(chance of a single drive failing on the next reboot)*(chance of a single drive failing on the next reboot). However, the chance of both drives failing on the same reboot is in fact much more related to the expected length of time before a drive goes into the faulty state where it will fail at the next reboot and comparing that to how often a server is rebooted.