In multiprocessor computing environments and other complex computing systems, large numbers of components are often organized along communication channels, which are used as system buses communicating data among the various physical components within the computing environment. The physical components can include network adapters or other communication interfaces, graphics cards, processing units, memories, or other physical subsystems.
One such high speed serial communication channel is a PCI Express system bus. PCI Express system buses require a PCI Express compatible interface associated with each component to which the bus connects. The interface sends and receives data from the system components connected to the PCI Express bus. To do so, the interface must send and receive transaction packets on the bus and manage the parallelization/serialization and routing of that data. The interface must also determine the type of transaction to be performed using header information contained in the received data, or must prepare transaction packets for other systems/interfaces to decode.
Data transmitted on the various communication channels in such computing environments is subject to errors from a variety of sources, such as a fault occurring in one of the components connected to the channel, transmission errors in the channel, or other types of errors. Error detection systems incorporated at the various bus interfaces detect errors in inbound transaction packets. Some of the errors can be corrected in hardware, and some cannot. When an uncorrectable error is detected, hardware systems generally write the erroneous data into memory, designate that data as “poisoned”, and wait until that memory location is read by software. When the memory location is read by a driver or other software, the system processor issues a Machine Check Abort (MCA) event, causing the software to shutdown the entire hardware system.
In large-scale server platforms with large I/O configurations and multiple virtual operating system domains, the error may not normally affect a portion of the computing system which is shut down by the MCA event. In such systems, the unconfined error causes a shutdown of the unaffected hardware regardless of its relationship to the error-causing component. Restarting an entire system after an MCA event can be time consuming, and can deprive users of valuable computing resources.
For these and other reasons, improvements are desirable.