1. The Field of the Invention
This invention relates to computer systems and, more particularly, to novel systems and methods for preventing data corruption due to time-gap defects in computer systems.
2. The Background Art
Computers are now used to perform functions and maintain data critical to many organizations. Businesses use computers to maintain essential financial and other business data. Computers are also used by government to monitor, regulate, and even activate, national defense systems. Maintaining the integrity of the stored data is essential to the proper functioning of these computer systems, and data corruption can have serious (even life-threatening) consequences.
Computers store information in the form of numerical values, or data. Information represented as data may take many forms including a letter or character in an electronic document, a bank account number, an instruction executable by a processor, operational values used by software, or the like. Data may be stored permanently in long-term memory devices or may be stored temporarily, such as in a random access memory. Data may flow between devices, over networks, through the Internet, be transmitted wirelessly, and the like.
Data may be changed or overwritten in many cases, such as when an account balance or date is automatically updated. However, computer users expect a computer system not to make inadvertent or incorrect changes to data, compromising its integrity. When these inadvertent or erroneous changes do occur, data corruption is incurred. The causes of data corruption may be numerous, including electronic noise, defects in physical hardware, hardware design errors, and software design errors.
Hardware design flaws may result from oversights or inaccuracies in specifying timing, function, or requirements for interfacing with other hardware in a circuit or computer system. Computer system hardware designers may build a certain amount of design margin into a system to allow for voltages to settle, signal rise and fall times, and the like. Specifications usually provide margins and limits. If insufficient design margin is provided or timing errors cause signals to be read at incorrect times, data corruption may result. Thus, even when data may be stored correctly in memory devices or calculations are performed correctly by a processor, data may be corrupted when transferred between hardware devices due to timing inconsistencies or insufficient design margin.
Different approaches may be used to reduce or eliminate data corruption. One approach may be to prevent data corruption from happening in the first place. This may be accomplished, in part, by improving the quality and design of hardware and software systems. Data is transmitted and manipulated by myriad different hardware components in a computer system including buses, controllers, processors, memory devices, input and output devices, cables and wires, and the like. Software may contain glitches or logical flaws. Each one of these hardware components or software applications is a possible candidate for incurring data corruption.
Another approach is to build error detecting and correcting capabilities into the hardware and software systems. Error correction such as parity checking, redundant systems, and validity checking can help to detect and correct data corruption.
In certain hardware systems, time gaps may exist in which erroneous data transfers between devices may occur, yet remain undetected by the hardware involved. Specifications for controllers or other devices in a computer system may have very rigorous time requirements stating when error processing may actually detect and report an error or not. There may not be an absolute time, but there may be an absolute time plus or minus a tolerance, where the tolerance value may be very small. This value may determine time gaps where errors may go undetected by a device. Detecting these time gaps in hardware systems may be critical in order to identify possible sources of data corruption due to faulty hardware design.
In some cases, occurrences of data corruption may be exacerbated by the arbitration that occurs between devices in a computer system. That is, because of the increase in handshaking, exchanges, and buffering that occurs between devices in a computer system, conditions may exist wherein errors may be incurred, yet remained undetected to the computer system. For example, clock speeds continue to increase in computer systems. In addition, expansion buses and ports, which may use different clocks speeds, are being added to facilitate the use of new input and output devices.
As a result, a computer system may increase in complexity due to increases in arbitration needed to pass information between the buses, ports, devices, bridges, and the like. Additionally, computer designers may design a computer system to be backward compatible with older and slower devices, but may provide insufficient error correction support for these devices in order not to slow overall system performance. One problem may be that a CPU actually sends or requests data before a controller can instruct it not to do so. The result is that data may be lost and in some cases may go undetected to the hardware involved in the data transfer. These types of problems may increase in frequency and number as newer and faster devices are interfaced to older legacy controllers and devices.
Input and output controllers within a computer system are responsible for arbitrating data exchanges between asynchronous devices, such as a CPU, and synchronous input or output devices, such as hard drives, floppy drives, CD-ROMs, and the like. Controllers dedicated to correctly effectuating these exchanges increase the efficiency of a computer system by reducing the amount of time and resources that devices such as a CPU would otherwise have to dedicate. Since a CPU may output data in bursts, as compared to an input or output device which may read or write information at consistent intervals, such as to rotating media, buffers may be used by the input or output controllers to temporarily store data.
Buffer underruns and overruns may occur when data is not provided to or read from a buffer quickly enough and may incur errors in a data transfer. As a result, errant values may be incorrectly read from an empty buffer or data may be lost when the buffer is overrun. Buffer underrun or overrun flags may be set to interrupt the devices involved in such a situation so that error correction capabilities may be invoked. However, if time-gap defects exist between I/O controllers and other devices in a computer system, interrupts may not arrive within the necessary time-frame to be properly responded to. Thus, proper timing of error detection and correction processes are critical to avoid data corruption and ensure that devices function properly.