Currently, there are several industry standard, commonly available, solutions for providing fault tolerance access to storage devices. Typically, protection schemes address hard disk drive failures, with the most common being Redundant Array of Independent Disks (RAID).
An Input/Output (I/O) controller in a computer system typically controls access to peripheral devices such as storage devices. To protect against a hardware failure of an I/O controller, a computer system may include two physically separate independent servers, each having an I/O controller that may be connected to an array of storage devices such as disk drives. The servers are clustered together at the operating system level (commonly referred to as clustering) and are coupled through a communication link which may use the Ethernet protocol. The communication link allows one server to detect a failure in the other server which may be due to a hardware failure. In the event of a hardware failure or other failure, for example, due to a software error resulting in operating system crash, the other server detects the failure and takes the place of the failed server.
Hardware protection of an I/O controller may also be provided in a single storage enclosure which may include two independent I/O controllers connected to an array of disk drives. The storage enclosure that includes the two I/O controllers is separate from the servers. There is a direct communications link between the I/O controllers, and if one I/O controller fails due to either a hardware or firmware (software routines stored in read only memory (ROM)) error, the surviving controller takes over. This is commonly referred to as failover and may be implemented in an active/active mode or active/passive mode. In the active/active mode, both I/O controllers work independently on separate data sets until one I/O controller fails. After the one I/O controller fails, the surviving I/O controller manages both data sets. In active/passive mode, one I/O controller operates in a monitoring only capacity until the other I/O controller fails.
These configurations protect against failure of an I/O controller or a server and when used in conjunction with RAID may also provide protection against disk drive failures. However, neither RAID, clustering or active/passive I/O controllers protect against silent data corruption due to soft errors that may occur within an I/O controller.
Soft errors involve changes to data and may be caused by random noise or signal integrity problems. Soft errors may occur in transmission lines, in logic, in magnetic storage or in semiconductor storage and may be due to cosmic events in which alpha particles result in random memory bits changing state from ‘0’ to ‘1’ or ‘1’ to ‘0’ that may result in an operating system crash. A soft error does not damage hardware; the only damage is to the data that is being processed. In many cases I/O controllers cannot detect if a soft error occurs while processing the data within the I/O controller. As such, silent data corruption can happen, resulting in incorrect data being written to storage devices. When silent data corruption occurs in a fault tolerant RAID configuration, the value of RAID is negated, as the volume contains corrupt data.
Although the following Detailed Description will proceed with reference being made to illustrative embodiments of the claimed subject matter, many alternatives, modifications, and variations thereof will be apparent to those skilled in the art. Accordingly, it is intended that the claimed subject matter be viewed broadly, and be defined only as set forth in the accompanying claims.