Many failure modes are possible in existing multi-nodal or cellular architecture computer systems. There are failure modes in multi-nodal, computer systems that are not well supported within existing boot or initial program load (IPL) firmware. In such a system, each system cell, or node, boots at a firmware level within the cell. The firmware of each cell then starts communicating with the firmware of other cells, with the goal of making one system, from the server OS's point of view, such that the cells are transparent, externally presenting the system as a single computer. This joining of cells is commonly referred to as rendezvous. Due to some sort of failure, such as a machine check abort (MCA), a cell or multiple cells may not make the rendezvous. In existing systems, that cell, or those cells, reboot and is/are unavailable to the system. In other words, in existing multi-nodal systems, if a cell does not make rendezvous it is left out of the system. As a result, if a particular cell has a resource that the system OS requires, and that cell fails to make rendezvous, the boot of the entire existing multi-nodal system may fail. Such a required resource may be the operating system disk drive, console universal asynchronous receiver/transmitter (UART) connector, local area network (LAN) system boot card, or the like.
Existing firmware user interfaces, designed to be accessed under normal boot conditions and/or from a system-wide perspective, have been implemented, but these interfaces cannot be invoked at the cell level during cell or system boot. Typically, for multi-nodal or cellular architecture server-class computers, when an error state arises during system start-up or boot, an available interactive interface with the system, known as the console, is invoked. Firmware specialist engineers or developers are often involved in the diagnosis of boot-firmware-related problems. However, a firmware specialist or developer is not typically able to gain access to the firmware via this console. In existing multi-nodal computers, firmware runs at a very low level in each node and the console does not allow selective access into the firmware of the cells of the system.
In existing firmware of multi-nodal systems that have an interpreter, it is possible to periodically poll, in the interloop of the interpreter, for a certain escape sequence at the console. However, such an interpreter-based scheme requires that firmware code be rebuilt during every boot calling for an inordinate amount of advance planning to place break points to invoke the console for a cell of a multi-nodal computer system.
Existing interrupt systems call for entering commands at compilation (i.e. anticipating where interrupts are needed) and then hitting the interrupt keys at the right time to catch the interrupt during boot Lip. For example, a particular implementation of an existing firmware console invocation scheme in a Forth-system-based Precision Architecture Reduced Instruction Set Computer (PA-RISC) Open Boot PROM (OBP) firmware system requires storing commands in the nonvolatile random access memory run command (NVRAMRC) file, stored in system nonvolatile random access memory (NVRAM). Forth is a computer programming language used in the OBP system and the IEEE P1275-standard for open firmware. These NVRAMRC stored commands conditionally execute early in bootstrap, prior to activating boot device drivers. This general purpose NVRAMEC file allows insertion of machine-code break points anywhere in the firmware, up to a fixed table limit. Upon encountering the breakpoint, the processor state is saved and the firmware provides an interactive debugger prompt at the system console that has full access to debugger commands as well as all normal Forth system commands. These break points appear to remain persistent across reboots because the commands that insert them into the program text are re-executed on every boot that uses NVRAMRC.
Existing firmware, for example the firmware of multi-node PA-RISC or Scalable Parallel Processor (SPP) based computers, might have interruptible locations polled during bootstrap. These Forth-based capabilities are hidden from end users in the product, but are useful to firmware and kernel developers in the development labs. However, these interruptible locations are not flexible or changeable by a user and are only active momentarily, during a boot sequence. If a user does not interrupt at the right time using an interrupt key, or keystroke combination, the boot firmware flow is never interrupted. Since each node in an SSP has its own NRAMRC, break points may be designated in any node or cell, not just the “monarch” node.
However, existing implementations of firmware consoles have not allowed for robustness in flow of control interruption. Such existing implementations have had the capability to interrupt boot of a cell of a multi-nodal computer system, but not at multiple points on individual cells. These implementations also do not have the ability to set multiple breakpoints that remain pervasive across boots through the setting of an NVRAM variable. Existing solutions do enable setting breakpoint through firmware NVRAM, nor do existing solution enable a nearly infinite number of potential breakpoints through an interrupt keystroke.