1. Field of the Invention
The present invention relates generally to data processing systems and more particularly to communications in a data processing system including multiple host computer systems and multiple adapters where the host computer systems share the multiple adapters and communicate with those adapters through a PCI switched-fabric bus. Still more specifically, the present invention relates to a computer-implemented method, apparatus, and computer usable program code for reporting an error that occurred in a device to a single master control host node that waits until all traffic in the paths in the fabric that might be affected by the error is suspended and all host nodes that might be affected by the error have acknowledged the receipt of a notice that the error occurred before clearing the error.
2. Description of the Related Art
A conventional PCI bus is a local parallel bus that permits expansion cards to be installed within a single computer system, such as a personal computer. PCI-compliant adapter cards can then be coupled to the PCI bus in order to add input/output (I/O) devices, such as disk drives or other devices, to the computer system. A PCI bridge/controller is needed in order to connect the PCI bus to the system bus of the computer system. The PCI bus can communicate, through the PCI bridge/controller with the CPU of the computer system in which the PCI bus is installed. Several PCI bridges may exist within a single computer system. However, these PCI bridges serve to couple multiple PCI buses to the CPU of the computer system in which the PCI buses are installed. If the single computer system includes multiple CPUs, the PCI buses can be utilized by the multiple CPUs of the single computer system.
A PCI Express (PCI-E) bus is a modification of the standard PCI Computer bus. PCI-E is based on higher speed serial communications. PCI-E is also architected specifically with a tree structured I/O interconnect topology in mind with a Root Complex (RC) denoting the root of an I/O hierarchy that connects a host computer system subsystem to the I/O.
PCI-E provides a migration path compatible with the PCI software environment. In addition to offering superior bandwidth, performance, and scalability in both bus width and bus frequency, PCI Express offers other advanced features. These features include QoS (quality of service), aggressive power management, native hot-plug, bandwidth per pin efficiency, error reporting, recovery and correction and innovative form factors, and meet the growing demands for sophisticated capabilities such as peer-to-peer transfers and dynamic reconfiguration. PCI Express also enables low-cost design of products via low pin counts and wires. A linearly scaled 16-lane PCI Express interconnect can provide data transfer rates of more than 8 Gigabytes per second.
The host computer system typically has a PCI-to-Host bridging function commonly known as the root complex. The root complex bridges between a CPU bus, such as hyper-transport, and the PCI bus. Other functions may be performed in the root complex like address translation, if necessary. Multiple host computer systems containing one or more root functions are referred to as a multi-root system. Multi-root configurations which share I/O fabrics have not been addressed well in the past.
Today, PCI-E buses do not permit sharing of PCI adapters among multiple separate computer systems. Known I/O adapters that comply with the PCI-E standard or a secondary network standard, such as Fibre Channel, InfiniBand, or Ethernet, are typically integrated into blades and server computer systems and are dedicated to the blade or system in which they are integrated. Having dedicated adapters adds to the cost of each system because an adapter is rather expensive. Further, the inability to share an adapter among various host computer systems has contributed to the slow adoption rate of these technologies.
In addition to the cost issue, there are physical space concerns in a blade system. There is a constraint in space that is available in a blade for adapters.
Multi-root I/O network configurations which share I/O fabrics have not been addressed well in the past. In known systems, when an error is detected, that error is reported to all host nodes. Thus, errors detected in an I/O fabric will generally bring down all of the host nodes that may be using that fabric.
Some errors affect all host nodes and should be reported to all of the hosts. For example, if a switch fails then all nodes should be notified. Other types of errors, though, affect only one or more particular host nodes but not all hosts. For example, if an adapter stops functioning, each host node that utilizes the adapter should be notified.
In known systems, all errors are reported to all host nodes regardless of whether the error affects one host node or all host nodes because there is no method for routing the reporting of errors to only the host nodes that might be affected by the error.
Therefore, a need exists for a method, apparatus, and computer program product for reporting an error that occurred in a device, also referred to herein as a component, to a single master control host computer system that waits until all traffic in the paths in the fabric that might be affected by the error is suspended and all host computer systems that might be affected by the error have acknowledged the receipt of a notice that the error occurred before the master control node clears the error where the error message is routed to only those host computer systems that might be affected by the error.