It is common in computer systems to perform diagnostic tests when a system is started (i.e., booted). Faults detected at that time result in the system being placed in an error state. The computer, therefore, cannot be started without user intervention.
It is also common in computer systems to monitor system components during normal operation to detect errors that may occur during normal operations. For example, parity is generated and subsequently checked as data is transmitted across data buses or when data is sent to a data storage device. Typically, any other diagnostic testing during normal system operation requires user intervention to specifically execute a diagnostic task. Also typically, these diagnostic tasks work only with a limited portion or sub-system of the computer and, in addition, that sub-system must be disabled for the diagnostic test to be performed.
While current computer technology provides methods for detecting errors in memory sub-systems and on data buses, heretofore those methods have not been used to provide similar protection to the interfaces connected to those data buses. For example, a processor may issue a write command across a PCI bus to an area of system memory. While the PCI bus is protected using some form of parity checking and the system memory may be protected using parity or ECC methods, the interfaces between the processor and the PCI bus, and the interface between the PCI bus and the memory have no such protection. While the anticipated error rates on these interfaces are relatively low, they are not zero.
These low error rates are generally considered to be tolerable for general computing applications. However, some high-end computer systems (i.e., mainframe computers) utilize end-to-end data checking which provides protection to the entire data path, including the interfaces. This protection is provided using auxiliary data check lines and checksums for data block transfers. This feature requires designs that have control of each of device or process along the data path. This approach is not practical in the “open systems” environment where components such as bus controllers, memory controllers, FIFO buffers, I/O controllers, etc. may be procured from a variety of different sources. Typically, each of these components is designed to have as low a cost as possible and, even if a standard existed, the incorporation of features to provide the end-to-end data checking could make these components non-competitive.
FIG. 1 shows a system block diagram of a computer system 100 of the prior art without any end-to-end data checking. Computer system 100 includes a central processor 102, memory controller 104, memory 106, processor to PCI bridge 108, SCSI controller 110, and fibre channel controller 112, all interconnected by data busses, generically referred to by reference number 114. External disk drives 116 are connected to computer system 100 through SCSI controller 110 by a parallel SCSI bus 118.
If data is read from the disk drives 116 into computer system 100, the data is transferred across the parallel SCSI bus 118, where parity checking detects data errors. The data must then pass through the SCSI controller 110, which may contain buffer memory, FIFOs, data holding registers, and similar internal components and sub-systems. If a data error occurs within one of these internal components of the SCSI controller 110 (i.e., the SCSI interface), the data error is not detected. The data is then transferred from the SCSI controller 110 across the PCI bus 114 to the memory controller 104. If an error occurs on the PCI bus, then parity checking on the PCI bus 114 may detect that error. The memory controller 104 may also contain data buffers, FIFOs and data holding registers. If an error occurs in the memory controller 104, that error is also not detected. If the memory 106 supports ECC or parity checking, then memory controller 104 generates the ECC or parity and stores it along with the data in the memory 106. If an error has occurred in memory controller 104, then the ECC or parity is generated on corrupt data, rather than on the original, correct data.
As has been shown, a risk of data corruption exists as data passes through both the SCSI controller 110 and the memory controller 104.
The need for end-to-end data checking is also present in storage routers. A storage router is a dedicated computer system that is attached to one or more host systems via external storage interfaces, such as fibre channel, parallel SCSI, Ethernet, Infiniband or ATM. The storage router is also attached to storage devices, such as disk drives or tape drives, via external storage interfaces. A storage router acts as a bridge, providing host computer systems access to the storage devices.
A storage router is typically implemented as a single chip computer (or processor) with one or more internal busses, but one PCI bus. I/O controllers attached to these PCI busses connect to the storage interfaces. A memory controller attaches to the PCI bus to provide both the processor and the I/O controllers access to a central memory. In such a storage router system, parity may be generated by both the single chip computer and by the I/O controllers. Data sent across these PCI busses is verified so that no errors are introduced in the data as it is transferred. There is, however, no provision for data checking within the I/O controllers or processor, or through the memory controller. If a hardware failure occurs in the I/O controllers, the memory controller, or the processor, it is possible for data being transferred through the storage router to be corrupted, without detection.
FIG. 2 shows a block diagram of a typical storage router 200, where storage router 200 is similar to computer system 100 shown in FIG. 1, but in addition to a first PCI bus 214a, it has a second PCI bus 214b. Memory space 206 is accessible via memory controller 204 through both of the PCI busses 214a, 214b. The processor 202 is typically an integrated device that contains dual PCI interfaces 214a, 214b and an internal memory controller 204, which controls a second memory area. The storage router also provides additional storage interfaces 210a, 210b, 220a, 220b, some of which are attached to host systems 222 rather than to storage devices (e.g., disk drives 166, etc.).
It is, therefore, a principal object of the invention to provide a method for performing a complete test of the data path within a system between system memory and storage interfaces.
It is an additional object of the invention to provide a method for checking data buses, data initiators and data targets within a system as well as all intervening data interfaces.
It is another object of the invention to provide a method for performing a complete end-to-end data path test which may be used with off-the-shelf hardware components and does not require a specific hardware configuration for its implementation.
It is a further object of the invention to provide a method for performing a complete end-to-end data path check wherein a data test pattern is periodically written by a data initiator to a data target.
It is an additional object of the invention to provide a method for performing a complete end-to-end data path check wherein the data test pattern is copied back to the data initiator and compared to the original pattern.
It is a still further object of the invention to provide a method for performing a complete end-to-end data path check wherein the data test pattern checks all data bit lines within a particular data path in both high and low states.
It is again an object of the invention to provide a method for performing a complete end-to-end data path check wherein multiple data paths are checked.