Given the continually increased reliance on computers in contemporary society, computer technology has had to advance on many fronts to keep up with both increased performance demands, as well as the increasingly more significant positions of trust being placed with computers. In particular, computers are increasingly used in high performance and mission critical applications where considerable processing must be performed on a constant basis, and where any periods of downtime are simply unacceptable.
Increases in performance often require the use of increasingly faster and more complex hardware components. Furthermore, in many applications, multiple hardware components, such as processors and peripheral components such as storage devices, network connections, etc., are operated in parallel to increase overall system performance.
Along with the use of these more complex components, the software that is used to operate these components often must be more sophisticated and complex to effectively manage the use of these components. For example, multithreaded operating systems and kernels have been developed, which permit computer programs to concurrently execute in multiple “threads” so that multiple tasks can essentially be performed at the same time. For example, for an e-commerce computer application, different threads might be assigned to different customers so that each customer's specific e-commerce transaction is handled in a separate thread.
One logical extension of a multithreaded operating system is the concept of logical partitioning, where a single physical computer is permitted to operate essentially like multiple and independent “virtual” computers (referred to as logical partitions), with the various resources in the physical computer (e.g., processors, memory, input/output devices) allocated among the various logical partitions. Each logical partition executes a separate operating system, and from the perspective of users and of the software applications executing on the logical partition, operates as a fully independent computer.
With logical partitioning, a shared program, often referred to as a “hypervisor” or partition manager, manages the logical partitions and facilitates the allocation of resources to different logical partitions. For example, a partition manager may allocate resources such as processors, workstation adapters, storage devices, memory space, network adapters, etc. to various partitions to support the relatively independent operation of each logical partition in much the same manner as a separate physical computer.
Along with the increased performance available in the aforementioned computer environments, however, comes increased potential for failure. Performing tasks in parallel often raises the possibility that one task may conflict with another task being performed, resulting in corrupt data or system failures. Likewise, as hardware-based components are always subject to at least some risk of failure, and as this risk often increases with the complexity of the hardware component, the use of increasing numbers of more complex hardware components increases the likelihood of encountering hardware component errors or failures during runtime.
As a result, cooperatively with the development of both multithreaded operating systems and logical partitioning, significant development efforts have been directed toward incorporating fault tolerance, high availability, and self-healing capabilities into modern computer designs.
One particular area to which development efforts have been directed is that of managing failures associated with the peripheral hardware components utilized by a computer, e.g., storage devices, network connections, workstations, and the adapters, controllers and other interconnection hardware devices utilized to connect such components to the central processing units of the computer. Peripheral components, which are referred to hereinafter as input/output (IO) resources, are typically coupled to a computer via one or more intermediate interconnection hardware devices components that form a “fabric” through which communications between the central processing units and the IO resources are passed.
In lower performance computer designs, e.g., single user computers such as desktop computers, laptop computers, and the like, the IO fabric used in such designs may require only a relatively simple design, e.g., using an IO chipset that supports a few interconnection technologies such as Integrated Drive Electronics (IDE), Peripheral Component Interconnect (PCI) or Universal Serial Bus (USB). In higher performance computer designs, on the other hand, the IO requirements may be such that a complex configuration of interconnection hardware devices is required to handle all of necessary communications needs for such designs. In some instances, the communications needs may be great enough to require the use of one or more additional enclosures that are separate from, and coupled to, the enclosure within which the central processing units of a computer are housed.
Often, in more complex designs, peripheral components such as IO adapters are mounted and coupled to an IO fabric using “slots” that are arrayed in either or both of a main enclosure or an auxiliary enclosure of a computer. Other components may be mounted or coupled to an IO fabric in other manners, e.g., via cables and other types of connectors, however, often these other types of connections are referred to as “slots” for the sake of convenience. Irrespective of the type of connection used, an IO slot therefore represents a connection point for an IO resource to communicate with a computer via an IO fabric. In some instances, the term “IO slot” is also used to refer to the actual peripheral hardware component mounted to a particular connection point in an IO fabric, and in this regard, an IO slot, or the IO resource coupled thereto, will also be referred to hereinafter as an endpoint IO resource.
Managing endpoint IO resources coupled to a computer via an IO fabric is often problematic due to the typical capability of an IO fabric to support the concurrent performance of multiple tasks in connection with multiple endpoint IO resources, as well as the relative independence between the various levels of software in the computer that accesses the IO resources. Failures occurring in the endpoint IO resources, as well as failures occurring in the components in the IO fabric itself, can also have a significant impact on the ability to access other endpoint IO resources in the system. Furthermore, given the desire for minimizing the adverse impact of failures in individual components to maintain overall system availability, significant efforts have been directed toward isolating failures and dynamically reconfiguring a system to overcome these failures.
In a logically-partitioned system, for example, IO slots can be assigned to individual logical partitions, and device drivers in each logical partition can control the IO adapter in each IO slot assigned to that partition. These IO slots are commonly connected to the overall computer and processor/memory complex through a common IO fabric that is effectively shared by all partitions having slots connected through common interconnection elements of that fabric.
In some logically-partitioned systems, the IO fabric may be comprised of a bridge fabric element connecting the processor/memory bus over a cabling bus to an external IO enclosure, and one or more additional bridge elements connecting the cabling bus to an IO bus having multiple IO slots. One such cabling bus implementation is a Remote Input/Output (RIO) bus, with a processor bridge referred to as a RIO hub used to interface the RIO bus with the process/memory complex, and with RIO bridge elements disposed in each external IO enclosure connecting the cabling bus to a plurality of PCI Host Bridges (PHB's) and, connected to each PHB, a plurality of PCI-PCI bridges that create the individual IO Slot connections into which are plugged PCI IO adapter cards.
In such systems, when an element of the IO fabric hardware detects an error, that hardware element typically enters an error state that suppresses continuing data transfer in either direction between the processor/memory complex and the remaining IO fabric and IO slot elements. Suppression of data transfer in this error state is precisely defined such that the element in error state discards all processor stores and adapter DMA's, and returns all-ones bitwise data to all processor loads.
It is common in many systems, particularly those employing PCI-compatible IO buses and adapters, for device drivers to use memory-mapped IO (MMIO) to communicate with the IO adapters. This allows device drivers installed in the partition operating systems to treat the adapter as if it were logically connected directly to the processor/memory bus and just an extension of the system memory occupying a particular memory address range. A device driver communicates with the adapter using processor load or store instructions targeting “memory” addresses that correlate directly to internal adapter facilities. In such a model the device drivers are largely unaware of the composition and arrangement of IO fabric elements, and rely on the IO fabric and IO adapters to behave as if the device drivers were simply accessing a memory region in response to a memory-mapped load or store.
In such systems, the device drivers typically rely on one of two methods to detect errors relating to the IO fabric. In the first method, the IO fabric is required to signal a machine check condition to a requesting processor when an MMIO load encounters a fabric element in an error state. A machine check is typically indicated by the return of a status signal with an access request or the triggering of an interrupt, and typically results in a processor diverting execution to a machine check interrupt handler that nearly always results in termination of the operating system and any applications executing thereon, due to data integrity concerns as a result of not being able to verify that previously-issued stores were successfully completed. In this case, the device driver and operating system are generally designed such that they cannot recover from the error without loss of data integrity. As a result, the common response to the error is to terminate execution of the entire logical partition (or system in a non-partitioned system).
In the second method, by convention, when in an error state the IO fabric and IO adapter are configured to respond to memory-mapped loads by returning a specific set of data that may be recognized by a device driver as potentially signifying an error. For example, one common set of data is referred to as all-ones bitwise data, where each bit of data returned in response to the memory-mapped load is set to one. In this case, the device driver is designed to inspect memory-mapped load reply data for an all-ones pattern, and in such cases, to call operating system services to determine if any element of the platform hardware had entered an error state that would cause this result. In many instances, the error state can be recovered from in the non-machine check method without terminating execution of a logical partition or operating system.
Machine check-based techniques predated many of the advances in dynamic recovery from hardware errors, and as such, device drivers and IO resources that require machine checks to be signaled are often non-recoverable in nature. The latter technique described above, however, often avoids the generation of machine checks and provides greater recoverability when used in connection with an appropriate recovery protocol, and as a result, device drivers and IO resources that rely on this technique are more typically recoverable in nature.
Using either approach, recovering from the IO fabric error, e.g., capturing error isolation data, resetting the affected hardware, and resuming normal IO operations, typically must be synchronized in such a way as to ensure that each affected device driver and IO adapter reliably detect the error condition, and that, until they detect this condition, IO between the device driver and that adapter is required to continue as if the fabric error state persisted. However, the time from the point at which the error is detected by the platform hardware and partition manager until all affected device drivers have also detected the error is unpredictable, and may be excessively long, which can significantly complicate and delay IO fabric recovery. In extreme cases, a device driver—such as for a CD ROM drive that may not be active at the moment of the error—may not perform an MMIO load from its adapters for extremely long periods, even as much as days, weeks, or months, depending on how long the device driver is itself active but not using its associated IO adapter.
On one hand, waiting for all device drivers to independently detect an error condition before completing the recovery from an IO fabric error often leads to excessive delays, unpredictable results and the possibility that a device driver in one logical partition could prevent other logical partitions from recovering. On the other hand, specifically alerting device drivers of errors in the IO fabric to permit recovery of the IO fabric to be completed can be problematic, particularly in logically-partitioned systems, due to incompatibility with older device drivers and the need to change, redesign or enhance device driver and kernel interfaces in complex ways.
Accordingly, a significant need exists in the art for a faster, more autonomous and more efficient manner of ensuring the device drivers for affected IO resources are able to detect and recover from errors in an IO fabric, particularly in logically-partitioned computers and the like, and especially facilitating the use of existing device drivers.
Another problem associated with non-recoverable device drivers that require machine checks to be performed, particularly in a logically-partitioned system, is that initializing the IO fabric typically exposes all partitions having a set of fabric elements in common to machine checks. This is highly undesirable in a partitioned system in that an error resulting from the failure of an IO adapter in use by one partition can result in machine check-initiated termination of multiple other partitions sharing common elements of that fabric.
In many computer environments, non-recoverable device drivers and IO resources have or will be replaced with recoverable device drivers and IO resources due to the significantly-reduced effect on system availability. As a result, it is often desirable to utilize only recoverable device drivers and IO resources whenever possible. In some environments, however, older legacy IO resources and device drivers may still be in use, and may not support more advanced recovery protocols. Furthermore, both non-recoverable and recoverable device drivers may be available for some IO resources.
As a result, in practice it is very difficult to identify which installed IO resources have recoverable device drivers, or for a particular resource, which version(s) of its device driver might require machine checks. Additionally, as customers reconfigure logical partition IO assignments, update device drivers, or add new IO adapters, it is possible that a partition that was uniformly one type or the other might be reconfigured such that this partition either obtains a mix of both types of device drivers or becomes uniformly the other type.
Consequently, a significant need also exists in the art for a manner of detecting the recoverability of IO resources and device drivers therefor in a mixed environment (i.e., where recoverable and non-recoverable resources and device drivers are permitted to co-exist), and dynamically configuring an IO fabric to reliably manage IO errors yet minimize the utilization of machine checks whenever possible.