1. The Field of the Invention
This invention relates to computer systems and, more particularly, to novel systems and methods for detecting errors in data exchanged between devices in a computer system, where an undetected data error may persist.
2. The Background Art
Computers are now used to perform functions and maintain data critical to many organizations. Businesses use computers to maintain essential financial and other business data. Computers are also used by government to monitor, regulate, and even activate, national defense systems. Maintaining the integrity of the stored data is essential to the proper functioning of these computer systems, and data corruption can have serious (even life-threatening) consequences.
Computers store information in the form of numerical values, or data. Information represented as data may take many forms including a letter or character in an electronic document, a bank account number, an instruction executable by a processor, operational values used by software, or the like. Data may be stored permanently in long-term memory devices or may be stored temporarily, such as in a random access memory. Data may flow between devices, over networks, through the Internet, be transmitted wirelessly, and the like.
Data may be changed or overwritten in many cases, such as when an account balance or date is automatically updated. However, computer users expect a computer system not to make inadvertent or incorrect changes to data, compromising its integrity. When these inadvertent or erroneous changes do occur, data corruption is incurred. The causes of data corruption may be numerous, including electronic noise, defects in physical hardware, hardware design errors, and software design errors.
Hardware design flaws may result from oversights or inaccuracies in specifying timing, function, or requirements for interfacing with other hardware in a circuit or computer system. Computer system hardware designers may build a certain amount of design margin into a system to allow for voltages to settle, signal rise and fall times, and the like. Specifications usually provide margins and limits. If insufficient design margin is provided or timing errors cause signals to be read at incorrect times, data corruption may result. Thus, even when data may be stored correctly in memory devices or calculations are performed correctly by a processor, data may be corrupted when transferred between hardware devices due to timing inconsistencies or insufficient design margin.
Different approaches may be used to reduce or eliminate data corruption. One approach may be to prevent data corruption from happening in the first place. This may be accomplished, in part, by improving the quality and design of hardware and software systems. Data is transmitted and manipulated by myriad different hardware components in a computer system including buses, controllers, processors, memory devices, input and output devices, cables and wires, and the like. Software may contain glitches or logical flaws. Each one of these hardware components or software applications is a possible candidate for incurring data corruption.
Another approach is to build error detecting and correcting capabilities into the hardware and software systems. Error correction such as parity checking, redundant systems, and validity checking can help to detect and correct data corruption.
In certain hardware systems, time-gaps may exist in which erroneous data transfers between devices may occur, yet remain undetected by the hardware involved. Specifications for controllers or other devices in a computer system may have very rigorous time requirements stating when error processing may actually detect and report an error or not. There may not be an absolute time, but there may be an absolute time plus or minus a tolerance, where the tolerance value may be very small. This value may determine time-gaps where errors may go undetected by a device. Detecting these time-gaps in hardware systems may be critical in order to identify possible sources of data corruption due to faulty hardware design.
For example, clock speeds used by computer systems are increasing rapidly. Additionally, new conflicts and timing discrepancies may arise between devices in a computer system. Errors may be introduced into data transfers due to inconsistences in timing requirements between hardware devices. Many of these hardware devices may be time sensitive and rely on different tolerances or levels of resolution in precision with respect to receiving or transmitting data. In some cases, rounding errors may cause devices to conclude that a data transfer has been performed correctly, when in fact errors were incurred into the operation.
Time-gap defects may occur in other scenarios as well and may be due to the timing inconsistencies as previously described. In some cases, designers may have unknowingly left timing inconsistences unaccounted for in their design of hardware or software systems. Good engineering may require that a certain amount of timing overlap be designed into systems in order to safeguard against timing inconsistencies that may exist. However, due to oversight, improper information, neglect, or the like, time-gap defects may be designed into systems.
Other conditions under which data corruption may occur may be identified by simply identifying those conditions that can delay data transfer between devices. Often, this condition may result from computer systems engaging in xe2x80x9cmulti-taskingxe2x80x9d operation or in overlapped input/output (xe2x80x9cI/Oxe2x80x9d) operation. Multi-tasking is the ability of a computer operating system to simulate the concurrent execution of multiple tasks. Importantly, concurrent execution is only xe2x80x9csimulatedxe2x80x9d because there is usually only one CPU in today""s personal computers, and it can only process one task at a time. Therefore, a system interrupt is used to rapidly switch between multiple tasks, giving the overall appearance of concurrent execution. In some case, the interrupts caused by switching from task to task may occur while a device is in the middle of a data transfer, such as a read or write operation, and be sufficient to incur an error into the data transfer.
In view of the foregoing, it is a primary object of the present invention to provide a detection module capable of detecting time-gap defects in computer systems.
Consistent with the foregoing objects, and in accordance with the invention as embodied and broadly described herein, an apparatus and method are disclosed, in suitable detail to enable one of ordinary skill in the art to make and use the invention. In certain embodiments an apparatus and method in accordance with the present invention may include a detection module stored in the memory of a computer system. The detection module may be configured to detect time-gap defects between controllers, between memory and input or output devices, or between any number of different hardware resources in a computer system. The detection module may include an input module, an initialization module, an operation module, a verification module, and an output module for performing its various functions.
An apparatus and method in accordance with the invention may be configured to march across a suspect domain by inserting delays into a data transfer operation. This xe2x80x9cmarchingxe2x80x9d process may occur by successively increasing the delays by a user-defined delay step value until an error is incurred into the data transfer. Once an error is incurred, the delay value may be reduced and the delay step value decremented.
The process may be repeated again by marching across the suspect domain in increments of the decremented delay step value, inserting the delays into the data transfer operation until an error is incurred. Once an error is incurred, the delay value may again be reduced and the delay step value decremented. The process of marching across the suspect domain and decrementing the delay step value may continue until a minimum delay step is reached. An apparatus in accordance with the invention may use this process to search for the minimum delay step needed to incur an error into the data transfer which remains undetected by the computer system.
For example, an apparatus and method in accordance with the the invention may initiate a data transfer between devices in a computer system. The data transfer, such as may be performed during a read or write operation, may be interrupted by a delay value having a user-defined duration. After the data transfer has finished, a test may be performed to determine whether the delay incurred an error into the data transfer or not. Once this determination is made, a test may then determine whether an error was detected by the devices involved in the data transfer.
If neither an error is incurred into the data transfer nor an error is detected by any of the devices, the data transfer may be repeated and a second delay time, having a longer duration than the first delay, may be inserted into the data transfer to interrupt the transfer. After the transfer has terminated, the same tests may be repeated.
In this manner, the process may be repeated until an error is incurred which remains undetected by the computer system. Thus, time-gap defects may be detected in a system. If there are not any time-gap defects detected, the process may be terminated once a maximum delay value is reached.