The present invention pertains generally to a multiple processor system, and more particularly to a multiple processor system that provides access to input/output devices in the face of a failure of a processor controlling the device.
Stock exchanges, banks, telecommunications companies, and other mission critical applications have relied on fault tolerant (FT) and highly available (HA) computer systems for many years. Downtime in these environments is extremely, costly and cannot be tolerated. As the reliance on computer systems for everyday activities continues to permeate our society and more services move on-line, twenty-four by seven operation and accessibility will become exceedingly important. For instance, consider Internet Service Providers (ISPs) and web servers. Users of ISPs and web servers demand continuous availability. Loss of availability translates directly to loss of revenue and loss of clientele. As a result, the demand for FT and HA computer systems is growing and will continue to grow. This growing demand will undoubtedly spawn a lot of research as industry and academia attempt to address availability issues. One such issue is the availability of I/O devices. In all the above mentioned examples, I/O plays a very important role. For example, a web server would be useless if it lost access to its network interfaces or disk controllers. Thus, in providing HA and FT solutions, the I/O devices must also be considered. Traditionally, in discussions of availability, the availability is centered around keeping processing elements or applications alive. I/O devices are usually only addressed by the use of redundant controllers or shared buses. This only provide; HA access to a device in the event of a device failure. For example, using two ethernet controllers to provide a reliable ethernet connection. The driver and upper software layers can switch to the remaining good ethernet controller in the event that one of the controllers fails. However, in recent years, the leading cause of system failures has shifted from hardware failures to software failures, such as operating system panics or hangs. The mean time between failures for devices continues to increase as component count is reduced allowing cleaner and more reliable devices to be built. Therefore, addressing HA access to I/O devices in the event of failures other than the device itself is an area for research. The goal of this invention is to define a way in which access to devices can be preserved in spite of operating system or processing entity failures without requiring the users or applications to have any knowledge of the failure or recovery.
Fault tolerant and highly available computer system architectures range from simple hot-standby arrangements to complex architectures which employ dedicated fault tolerant hardware and hardened operating systems. The latter systems are most effective in providing availability since they have been designed with the goal to survive any single point of hardware failure. However, the system component replication of this hardware architecture has a price premium associated with it. For example, to ensure processor instruction integrity, three CPUs can be used in a triple modular redundancy (TMR) relationship to form one logical CPU. Before each instruction, the CPUs submit the instruction they intend to execute to a voter. If all three instructions are the same, the voter allows the instruction to be executed, if they are not, the odd one is voted out of the TMR relationship. While this ensures processor instruction integrity, it triples the cost of the CPU component. Even with the component replication, the FT architecture can still be susceptible to a single point of failure: the operating system. If the operating system panics or hangs, the entire machine is rendered useless in spite of the FT hardware. This is the case with all traditional bus-based, single node computer systems. If the operating system or the processing entity (CPU(s) and associated memory) fails, the entire system fails. As a result, the I/O devices associated with the system are no longer accessible.
This problem can be partially solved by employing the use of a distributed operating system and the concept of clustering. A distributed operating system allows a collection of independent computer systems to be connected together via a communication interconnect forming a cluster which car operate as a single system or as a collection of independent resources. This removes the operating system as the single point of failure. Since there are multiple instances of a cooperating operating system, the loss of one instance does not imply the loss of the collective system. In other words, each node is an independent fault zone. In this situation, the loss of a node as the result of a processing entity or OS failure only results in the loss of access to the devices which were attached to and being controlled by that node. This architecture can be exploited to provide more availability than the first case by attaching redundant controllers to different nodes in the group. These redundant controllers are attached to a common, shared resource. For example, consider a shared small computer system interface (SCSI) bus. Two nodes in the cluster each have a SCSI controller attached to a common, shared SCSI bus. If one of the nodes dies, the SCSI devices on that bus can still be accessed via the surviving node and its controller. While this allows a certain degree of device availability in spite of OS failures, it requires redundant controllers with shared resources. Devices and resources which do not support sharing cannot be addressed with this model.
Single system image (SSI) distributed operating systems allow a group of independent nodes to be clustered together and act as a single system. An SSI distributed operating system gives the illusion that physical machine boundaries are erased. This allows the user to access remote resources as if they were local. In addition, users can execute remote programs using local execution semantics. From the outside, the cluster appears to users as if it is a single computing resource. From the inside, the cluster appears to processes as if it is a single computing resource. As a result, devices in the cluster will appear as if they are independent of any particular node. In reality, this is not the case, it is just an illusion provided by the SSI distributed OS. However, the illusion can be exploited to help hide the loss of a node, and thus devices, from the user. In order for all devices to remain physically available in spite of an operating system failure, I/O devices must reside in a different fault zone than the processing entities. By having processing entities and I/O devices in different fault zones, the loss of the processing entity due to a hardware or operating system failure does not imply the loss of the I/O devices. This separation of fault zones is possible with a system area network (SAN) based system architecture. A SAN-based system is fully interconnected and allows any-to-any (CPU-to-CPU, CPU-to-I/O, I/O-to-I/O) communication between the components that make up the system. Unlike traditional bus-based system architectures, SAN-based systems allow I/O devices to be physically independent of any particular processing entity. As a result, from a hardware perspective, a device can be accessed by any of the processing entities, not just the processing entity that it shares a back-plane with as in the traditional bus-based system architecture. The loss of a processing entity does not preclude access to the devices that it was controlling. The I/O devices are still functional and accessible. In the event of an operating system or processing entity failure, the devices being controlled by that instance of the operating system should be transparently failed over to be controlled by some other instance of the operating system on a different processing entity in the cluster.
The SAN-based system architecture provides the framework for physical device access to remain intact in spite of an operating system or processing entity failure. However, transparently handling the failure and fail-over at the operating system level involves a multitude of operations not provided by the hardware architecture or the distributed operating system, SSI or not. From a software perspective, devices are still tightly coupled to a particular processing entity since the device driver that controls the device resides in a particular instance of the operating system on a processing entity. All system and user requests for the device must pass through the device driver before going to the device. Likewise, any data coming from the device must pass through the device driver before being returned to the system or user. Any requests for that device from anywhere in the cluster will have to come to that node and go through its device driver. Since device drivers are part of the operating system, when an operating system or processing entity fails, the device drivers on that processing entity are lost as well. Therefore, the software control for that device is lost. The SAN-based architecture provides a physical path by which the device can still be accessed. The distributed operating system allows the collective system to still be alive although a node in its collective has failed. However, all requests active on the device or in transit to the device at the time the processing entity failed will now be lost or return in error since the controlling entity (device driver in the OS) for the device is no longer available. Thus, while the SAN-based system provides the physical connectivity, and the distributed OS keeps the OS from being a single point of failure for the collective system, nothing is available to recover the device and make use of it.
Thus, it can be seen that there is a need to define a way in which access to devices can be transparently recovered, that is, rebuilding the internal system software framework to allow some other node in the cluster to assume the role of the controlling entity for the device in such a way that users and applications are not aware that a failure occurred. This has many implications. For instance, the device's state must be preserved so it looks exactly the same as it did, before the failure. The device driver on the new controlling node must have the same state information and be able to process requests in the same manner as the device driver on the failed node. The way in which the user accesses the device must remain exactly the same although the user's requests now have to go to a different node in the cluster. Any requests in transit or on the device at the time of the failure must be analyzed and replayed if necessary. All this and more needs to take place transparently. The combination of the SAN-based system architecture and an SSI distributed operating system provides the basic framework on which a solution for high availability access to I/O devices can be built.
However, even though the operation of a failed node may be taken over by another node of the cluster, the resources (i.e., peripheral devices connected to and controlled by the failed node) are often lost. The controlling entity (i.e., the CPU, its memory, operating system and device drivers) of the device has failed, but the device itself is still functional. Access to these devices may be critical and loss of access may be highly undesirable. For example, the availability of the processor unit in a web server is inconsequential if the disk controller or network interface cannot be accessed. It is desirable, therefore, that in the event of a node failure, access to the devices controlled by that node be transferred to another node.