Cellular Computer Systems
Modern, high performance, computer systems typically incorporate multiple processors. It is known that some computer systems may embody primary processors of multiple instruction set types, multiple processor systems having primary processors of multiple instruction set architectures (ISAs) are known herein as heterogeneous computer systems.
In addition to primary processors, upon which operating system and user programs run, there are typically additional embedded processors of additional types. Embedded processors are typically provided for control of specific hardware devices, such as disk drives, in the system. In a computer system, embedded processors may perform system management functions as monitoring of primary processor voltages and temperatures, control of cooling subsystems, as well as boot-time configuration of various system components.
A family of high performance heterogeneous computer systems from Hewlett-Packard is capable of being configured to use primary processors of two or more types, including the Intel Itanium and PA8800 instruction set architectures.
In this family of computer systems, a field replaceable “cell” has several primary processor circuits of the same type, together with memory, circuitry for communicating with other cells over a backplane bus, input output (I/O) bus interface circuitry, JTAG (Joint Test Action Group) scan circuitry, and other circuitry. There may be one or more additional embedded processors in each cell to perform system management functions.
One or more cells, which may but need not be of the same type, are installed into a backplane. A heterogeneous computer system is formed when cells having more than one type of processors are inserted into the backplane; a homogeneous computer system is formed if all cells have processors of the same type.
This family of computer systems supports simultaneous execution of multiple operating systems, including multiprocessor versions of Windows-NT, Unix, VMS, and Linux. Multiple instances of each system are also supported. Each operating system instance operates in a partition of a computer system.
At system boot time, a set of processors of a particular type are assigned to operate in each partition. These processors may belong to more than one cell, but must all be of the same ISA. As each operating system boots, or reinitializes, these processors become aware of each other and appropriate task routing and assignment datastructures built in system memory, a process known herein as a Rendezvous of these processors.
During Rendezvous, a processor of each cell is designated a master processor for that cell. One cell of each partition is designated as the master cell for the partition.
Firmware
Most computer systems, including cellular computer systems, have firmware that executes at boot time on each processor. This firmware is responsible for performing self testing, for conducting interprocessor communications including those involved in the Rendezvous and loading the operating system. Firmware is particularly responsible for boot-time operation, once an operating system is loaded some firmware functions may be replaced by related operating system functions.
Error Detection and Handling
Firmware of such computer systems is generally capable of detecting errors that may arise from many causes. As an example, these causes may include problems detected during self-test, interprocessor communication problems, attempts to load incompatible code, rendezvous with processors of an incompatible instruction set architecture, and other errors. Cellular computer systems often recognize multiple errors from single causes, for example corrupt interprocessor communications can be recognized by each of the processors involved.
Many computer systems have firmware capable of handling single errors.
A common method of error reporting is the return-value mechanism. With this mechanism, each subroutine capable of detecting an error returns either zero or an error flag in a register.
Many routines in Unix and similar operating systems combine a single-flag version of the return-value mechanism with an error code stored in a known location. This error code, or ERRNO, can be read by a calling program to obtain some stored information about an error.
Both the return-value mechanism and its ERRNO enhancement are capable of passing only limited information about an error.
In a large, multicellular, computer system, it is desirable to provide the ability to flexibly handle multiple errors. It is desirable to pass sufficient information about an error to other firmware components to permit proper diagnosis and, if possible, automatic recovery from the error. It is also desirable to automatically handle and recover from many potential errors so that at least partial system operation may continue.