Conventionally, upon detecting an error relating to existence of uncorrectable data within a clustered virtual environment, e.g. data pending submission to an I/O component or operation as part of a write request, a data transmission, etc. the process and/or virtual machine, server, network component, etc. from which the uncorrectable data originates is subjected to a stop operation to prevent propagation of the uncorrectable data outside the source.
In order to maximize system availability and maintain service via the clustered virtual environment, some existing techniques will deploy a secondary, redundant component (e.g. secondary server, etc.) mirroring processes being performed by the primary component. In the event of detecting an error in a pending process on the primary component, the mirrored process on the secondary component may be utilized instead of the primary component, thus maintaining overall system performance despite the uncorrectable error.
The foregoing techniques are effective to address detectable errors. However, other errors are also known to arise in such a manner that the error is undetectable, and these errors may propagate (often to great depth) throughout environments to which the source is in communication. This is known as silent data corruption, and may be caused by a number of problems such as loose cabling, unreliable power supplies, external vibrations, cosmic radiation (and other sources of soft memory errors), and errors introduced by the network environment, etc. Most commonly, silent data corruption (also known as “soft errors”) occurs as a result of an alpha particle or cosmic ray interacting with a bit, causing the bit to flip orientation in a manner undetectable by the system.
Silent data corruption may result in cascading failures, in which the system may run for a period of time with undetected initial error causing increasingly more problems until it is ultimately detected. For example, a failure affecting file system metadata can result in files being partially damaged or made completely inaccessible as the file system is used in its corrupted state.
Accordingly, it would be beneficial to provide systems, methods, computer program products and the like which prevent propagation of errors caused by silent data corruption within a clustered virtual environment.