Society has become extremely dependent upon computers. In today's world, computers are used for everything from financial planning, to company payroll systems, to aircraft guidance systems. Because of the wide spread use of computers systems, data corruption is a problem that can affect almost any individual and an issue that continues to plague both the computer hardware and computer software industries.
For example, software applications, such as database applications, are extremely dependent upon maintaining the integrity of their data. If the data associated with a database application is corrupted, users may experience incorrect results and possibly system crashes.
Data corruption may result from a variety of reasons and from a variety of different sources. For example, a software “bug” in a database application may itself cause invalid data, such as a negative social security number or invalid pointer address, to be stored in a table or data structure. In addition, other programs executing on the same computer system, including the operating system itself, may inadvertently over-write certain variables, tables, data structures, or other similar types of information, thus corrupting the data that is associated with a particular software application. Still further, when an application writes a block of data to a storage medium, the data typically travels through many intermediate layers of software and hardware before it is actually stored to the storage medium. Hence, there is even a further potential for the data block to become corrupted prior to, or at the time it is being written to the storage medium.
For example, when writing a data block to disk, the data may travel from the software application to a volume manager, from the volume manager to a device driver, from the device driver to a storage device controller, and from the storage device controller to a disk array before being stored onto disk. When the data block is later read from the disk, the data must again travel through the same set of software and hardware layers before it can be used by the software application. Thus, a bug at any of these layers may potentially corrupt the data. Additionally, if the disk is unstable, thus causing errors to be introduced into the data after it is written to disk, the integrity of the data may be compromised even if the other layers do not erroneously alter the data.
When an I/O subsystem reports that a write operation has been completed even though the I/O subsystem has actually failed to write data to an I/O device, a “lost write” has occurred. A lost write may occur even if the I/O subsystem eventually succeeds in writing data to the I/O device, if the actual writing is delayed long enough for an application to read data before the actual writing occurs.
Lost writes may lead to data corruption. Since the write operation does not actually occur (or occurs too late), applications may obtain stale, incorrect data when they later read, from the I/O device, data that should have been overwritten and updated but wasn't. If the applications then use that data to generate or determine additional data, that additional data might also be incorrect. Should the applications then write the incorrect additional data to the I/O device, they unwittingly propagate the corruption. In this manner, data corruption spreads over the I/O device like a disease, compounding over time, causing all kinds of errors.