1. Field of the Invention
This invention relates to computer systems, and more particularly to error detection in computer systems.
2. Description of the Related Art
With the growing deployment of computer systems and software, applications often operate in distributed, heterogeneous environments. Processing in a complex application may be partitioned across network segments, CPU clusters and storage locations. Furthermore, computer systems in distributed, heterogeneous environments may include many different components that impact overall reliability and/or availability of the systems. For example, one or more of the computer systems may have storage device arrays that must operate correctly in order for applications and/or other components to operate correctly. In addition, the computer systems may rely upon network adapters to provide a connection to the storage device arrays. The network adapters must operate correctly in order for applications relying on the connection to the storage device arrays to operate correctly. Thus, there may be numerous types of components operating in computer systems that impact the overall reliability and/or availability of the systems.
The increasing complexity of software and the increasing degree of dependence on computer systems have imposed adoption of various techniques to obtain maximum reliability and availability. Unfortunately, unreliable components operating in a computer system may directly impact availability of the components. The reliability of an individual component refers to how likely the component is to remain working without a failure being encountered, typically measured over some period of time. The reliability of a computer system is a function of the reliability of the components in the system. In general, the more components operating in the computer system, the worse the reliability of the system is as a whole. The reliability of many components, such as network adapters and storage device arrays, is often expressed in terms of mean time between failure (MTBF).
Numerous problems may arise while implementing solutions to increase the reliability and/or availability of individual components, and thus, the reliability and/or availability of computer systems as a whole. For example, although storage device arrays may be used to increase availability of a system by storing redundant data, the ability to reconstruct lost data may depend on how many failures have already occurred. A redundant array of inexpensive disk (RAID) component in a computer system may only be able to tolerate a single disk failure. Therefore, once a single disk fails, if additional disks fail before lost data on the failed disk has been reconstructed, it may no longer be possible to reconstruct any lost data. Such systems are said to be operating in a degraded mode. The longer a storage device array operates in a degraded mode, the more likely it is that an additional failure will occur. As a result, a storage device array operating in a degraded mode decreases reliability of the storage device array and may actually cause data loss before a problem with the component is identified.
Another potential problem that may affect the reliability and/or availability of a component is that errors other than total failures may occur. Like total failures, these errors may cause data vulnerability or data loss. For example, disk drives may occasionally corrupt data. Data corruption may occur for different reasons. For example, bugs in a disk drive controller's firmware may cause bits in a sector to be modified or may cause blocks to be written to the wrong address. Such bugs may cause disk drives to write the wrong data, to write the correct data to the wrong place, or to not write any data at all. Another source of errors with a component in a computer system may be a drive's write cache. Many disk drives use write caches to quickly accept write requests so that a host computer or array controller can continue with other commands. The data is later copied from the write cache to the disk drive. However, write cache errors may cause some acknowledged writes to never reach the disk drive. The end result of such bugs or errors is that the data at a given block may be corrupted or stale. These types of errors may be “latent” because the disk drive may not realize that it has erred. If left latent, such errors may have detrimental consequences such as undetected long-term data corruption.
Latent errors may be present within many different types of components. A component may not show or present signs of a problem even though the component is operating in a degraded state. In fact, the computer system may be completely unaware of an existence of a latent error. For example, one network card within a complex distributed network may fail and the failure may not become known until the computer system attempts to use the network card.