Aspects generally relate to the field of distributed storage, and, more particularly, to detection of storage cluster failures.
Whether maintaining customer data or their own data, businesses require always available or highly available data and protection of that data. To support these requirements, data often resides across multiple storage systems in multiple sites that are often great distances apart. One reason these sites are separated by great distances is to prevent a single catastrophe impacting data availability. Metrics used to define the availability requirements include recovery point objective (RPO) and recovery time objective (RTO). A business specifies an RTO as the maximum amount of time that the business tolerates lack of access to the business' data. A business specifies an RPO as the amount of data in terms of time that can be lost due to an interruption. For instance, a business can specify an RTO as 15 seconds. In other words, the business will accept at most 15 seconds from the time of a service interruption or failure to the time their data is again available. For an RPO, a business can specify five seconds. That means that the business will not accept losing any more than the data written (e.g., new writes, updates, etc.) in the five seconds that precede a failure or interruption.