1. The Field of the Invention
This invention relates to computer systems and, more particularly, to novel systems and methods for detecting errors in data exchanged between devices in a computer system, where an undetected data error may persist.
2. The Background Art
Computers are now used to perform functions and maintain data critical to many organizations. Businesses use computers to maintain essential financial and other business data. Computers are also used by government to monitor, regulate, and even activate, national defense systems. Maintaining the integrity of the stored data is essential to the proper functioning of these computer systems, and data corruption can have serious (even life-threatening) consequences.
Computers store information in the form of numerical values, or data. Information represented as data may take many forms including a letter or character in an electronic document, a bank account number, an instruction executable by a processor, operational values used by software, or the like. Data may be stored permanently in long-term memory devices or may be stored temporarily, such as in a random access memory. Data may flow between devices, over networks, through the Internet, be transmitted wirelessly, and the like.
Data may be changed or overwritten in many cases, such as when an account balance or date is automatically updated. However, computer users expect a computer system not to make inadvertent or incorrect changes to data, compromising its integrity. When these inadvertent or erroneous changes do occur, data corruption is incurred. The causes of data corruption may be numerous, including electronic noise, defects in physical hardware, hardware design errors, and software design errors.
Hardware design flaws may result from oversights or inaccuracies in specifying timing, function, or requirements for interfacing with other hardware in a circuit or computer system. Computer system hardware designers may build a certain amount of design margin into a system to allow for voltages to settle, signal rise and fall times, and the like. Specifications usually provide margins and limits. If insufficient design margin is provided or timing errors cause signals to be read at incorrect times, data corruption may result. Thus, even when data may be stored correctly in memory devices or calculations are performed correctly by a processor, data may be corrupted when transferred between hardware devices due to timing inconsistencies or insufficient design margin.
Different approaches may be used to reduce or eliminate data corruption. One approach may be to prevent data corruption from happening in the first place. This may be accomplished, in part, by improving the quality and design of hardware and software systems. Data is transmitted and manipulated by myriad different hardware components in a computer system including buses, controllers, processors, memory devices, input and output devices, cables and wires, and the like. Software may contain glitches or logical flaws. Each one of these hardware components or software applications is a possible candidate for incurring data corruption.
Another approach is to build error detecting and correcting capabilities into the hardware and software systems. Error correction such as parity checking, redundant systems, and validity checking can help to detect and correct data corruption.
In certain hardware systems, time-gaps may exist in which erroneous data transfers between devices may occur, yet remain undetected by the hardware involved. Specifications for controllers or other devices in a computer system may have very rigorous time requirements stating when error processing may actually detect and report an error or not. There may not be an absolute time, but there may be an absolute time plus or minus a tolerance, where the tolerance value may be very small. This value may determine time-gaps where errors may go undetected by a device. Detecting these time-gaps in hardware systems may be critical in order to identify possible sources of data corruption due to faulty hardware design.
For example, clock speeds used by computer systems are increasing rapidly. Additionally, new conflicts and timing discrepancies may arise between devices in a computer system. Errors may be introduced into data transfers due to inconsistences in timing requirements between hardware devices. Many of these hardware devices may be time sensitive and rely on different tolerances or levels of resolution in precision with respect to receiving or transmitting data. In some cases, rounding errors may cause devices to conclude that a data transfer has been performed correctly, when in fact errors were incurred into the operation.
Time-gap defects may occur in other scenarios as well and may be due to the timing inconsistencies as previously described. In some cases, designers may have unknowingly left timing inconsistences unaccounted for in their design of hardware or software systems. Good engineering may require that a certain amount of timing overlap be designed into systems in order to safeguard against timing inconsistencies that may exist. However, due to oversight, improper information, neglect, or the like, time-gap defects may be designed into systems.
Other conditions under which data corruption may occur may be identified by simply identifying those conditions that can delay data transfer between devices. Often, this condition may result from computer systems engaging in “multi-tasking” operation or in overlapped input/output (“I/O”) operation. Multi-tasking is the ability of a computer operating system to simulate the concurrent execution of multiple tasks. Importantly, concurrent execution is only “simulated” because there is usually only one CPU in today's personal computers, and it can only process one task at a time. Therefore, a system interrupt is used to rapidly switch between multiple tasks, giving the overall appearance of concurrent execution. In some case, the interrupts caused by switching from task to task may occur while a device is in the middle of a data transfer, such as a read or write operation, and be sufficient to incur an error into the data transfer.