1. Field of the Invention
The present invention relates in general to the field of computer systems using PCI Express technology, especially implemented input/output (I/O) and accelerator components, and in particular to a PCI Express multiplier device, a corresponding computer system comprising such a PCI Express multiplier device, and a corresponding method for operating PCI Express devices in a computer system. Still more particularly, the present invention relates to a data processing program and a computer program product for operating PCI Express devices in a computer system.
2. Description of the Related Art
The IT industry tries to create reliable systems based on inexpensive and less reliable components. For the goal of reliable systems, this bears several challenges like failures of components, since inexpensive components may fail completely, or silent data corruption (SDC), since inexpensive components may corrupt data. Due to the lack of consistent error checking, this data corruption may be propagated out of the component and remain unnoticed, which means the computer system works with corrupt data and is not aware of that. While failures do not endanger system integrity but potentially “only” cause outage, the SDC problem is a much more serious problem in the industry.
The present invention addresses these problems for I/O and accelerator components, specifically Peripheral Component Interconnect Express (PCI Express) devices, and shows a way to detect and recover from failures and silent data corruption without changing the I/O respective accelerator components and software. PCI Express is a very common and inexpensive technology for attaching I/O devices and accelerators.
Moreover, the PCI Express link technology allows for error detection of data transfers between computer systems and I/O components respective accelerators.
In Patent Publication U.S. Pat. No. 7,370,224 B1, “SYSTEM AND METHOD FOR ENABLING REDUNDANCY IN PCI EXPRESS ARCHITECTURE” by Jaswa et al., a method and system to enable redundancy in the communication between a plurality of peripheral devices and redundant hosts through redundant switches are disclosed. The peripheral devices and the host are connected through PCI Express architecture in a data processing system. A described embodiment of the system includes a switch, a redundant switch, and a switch-level exchanging means. The switch-level exchanging means enables the exchange of data packets between the peripheral devices and the host, through an available switch. The available switch is either the switch or the redundant switch. Another described embodiment of the system also includes a redundant host and host-level exchanging means. The host-level exchanging means enables the exchange of data packets between an available host and the available switch. The available host is either the host or the redundant host. The described embodiments try to make the corresponding systems redundant. But no silent data corruption is detected and no transparency to software is available. Also in case of a peripheral device failure, software needs to recover/activate another peripheral device, and a switched infrastructure is needed. Also the switches and devices have to communicate with each other to ensure redundancy, these impacts peripheral devices.
In the Patent Application Publication WO 2006/137029 A1, “METHOD FOR PARALLEL DATA INTEGRITY CHECKING OF PCI EXPRESS DEVICES” by Wood et al., an apparatus and method for supporting PCI Express are disclosed. A physical layer has a PCI Express interface for receiving data from a PCI Express compatible communication medium. The data is in the form of a packet. A data link layer is disclosed for verifying a CRC (Cyclic Redundancy Check) value and a sequence number received within the packet. A transaction layer is disclosed for receiving the packet from the data link layer and for processing thereof. The transaction layer processes at least some of the packet data in parallel to the data link layer. The apparatus does parallel checking for data integrity, however, it only covers acceleration of CRC checking of a single device, not checking, whether the device itself didn't corrupt data. Therefore, it does not detect silent data corruption of the peripheral device.