The present invention relates to the field of multiprocessing. More specifically, it pertains to a method and apparatus for resetting a multiprocessor system and for ensuring that the system periodically demonstrates its proper functionality. The invention also extends to a novel processing element for use in a multiprocessing system and to a computer readable storage medium including a program element implementing an operating system that can verify the functionality of the processing element.
Within the ever evolving world of computer systems, a particular change has arisen with respect to the design of better and faster systems. Originally, systems were implemented in a uni-processor environment, whereby a single Central Processing Unit (CPU), hereafter referred to as processor, was responsible for all computer performance, including computations and IO. Unfortunately, uni-processor designs have built-in bottlenecks, where the address and data buses restrict data transfer to a one-at-a-time trickle of traffic, and the system program counter forces instructions to be executed in strict sequence. Rather than designing better, faster uni-processor machines which will never fully overcome the bottleneck limitation, a different computer system design was realized in order to effect real improvements in computer performance, specifically the multiprocessor system
The multiprocessing environment involves the use of more than one processor, also referred to as Processing Element (PE), where these processors share resources, such as IO channels, control units, files and devices. Within a particular computer system, these processors may be in a single machine sharing a single bus or connected by other topologies (e.g. crossbar, grid, ring), or they might be in several machines using message-passing across a network. An important capability of the multiprocessor operating system is its ability to withstand equipment failures in individual processors and to continue operation. Although there are different basic operating system organizations for multiprocessor systems, one example is symmetric multiprocessing, where all of the processors are functionally equivalent and can perform IO and computation. In this case, the operating system manages a pool of identical PEs, any one of which may be used to control any IO device or reference any storage unit. Note that the same process may be run at different times by any of the PEs.
One of the roles of an operating system in a multiprocessor system is to provide the ability for application code to be given some CPU time in a regular fashion. A running instance of application code is known as a process, and the operating system ensures that each process is provided with CPU time in accordance with the needs of the process, as well as the requirements of the multiprocessor system.
In the case of a running Digital Multiplexing Switch (DMS), the system can contain thousands of processes in various states of readiness. It is always possible that some interaction of processes, or some software or hardware fault, will disable the ability of the system to run processes. Such a situation is commonly referred to as Insanity. A mechanism known as Sanity testing is provided so that a running switch can monitor itself to ensure that it has not entered a state where processes are not being run in a fair way. In other words, Sanity tests the system to ensure that it continues to perform the minimal necessary functions which allow the switch to perform useful work.
Robust real time systems require a xe2x80x9cheartbeatxe2x80x9d mechanism which demonstrates that the system is functioning properly. Failure to demonstrate proper functionality must result in recovery actions that do not depend upon the proper functioning of the system. Additionally, the ability to reset the system for maintenance purposes must exist at all times. For a symmetric multiprocessor system, for example a switch, these capabilities must be provided in a completely distributed, robust manner, ensuring not only a system-wide sanity, but also a per PE sanity, in order to detect and correct the major sanity dangers in the running switch. Examples of these sanity dangers include:
A process has run for longer than the maximum allowable amount of time without allowing a context switch;
A PE has developed a hardware fault which does not allow software to run on that PE;
A PE has developed a hardware fault which does not allow the operating system to receive system critical interrupts (e.g. timer interrupts);
The operating system queues are corrupted so that no PE can run a process;
The software load has been corrupted such that no process can tun on any PE;
The software load can not be restarted on any PE.
In existing uniprocessor systems, two complete instances of hardware and software exist. The two system instances execute in lockstep, whereby both system instances perform the exact same task at the exact same time. Loss of lockstep indicates the detection of a hardware error. Maintenance software is invoked to diagnose, isolate and recover from the fault. Each system instance has one timer which has to be cleared periodically to demonstrate that the real time nature of the system is functional. Manual reset is provided by a duplicated sub-system that has access to two lines of reset, one per system instance. When one system instance needs to be reset, the other system instance will take over as master for the uniprocessor system.
A multiprocessor, shared memory system as disclosed in co-pending U.S. patent application Ser. No. 08/774548, entitled xe2x80x9cShared Memory Control Algorithm for Mutual Exclusion and Rollbackxe2x80x9d, by Brian Baker and Terry Newell, and incorporated herein by reference, effects certain permanent system changes in xe2x80x9ctransactionsxe2x80x9d. In this system, multiple processors execute processes that may modify shared memory. Memory changes made by a process executing on a processor do not permanently affect the shared memory until the process successfully completes. During process execution, memory used by a process is xe2x80x9cownedxe2x80x9d by that process; read and write access by other processes is locked out. If a process does not successfully complete or attempts to access memory owned by another process, the process is aborted and memory affected by the process is xe2x80x9crolled backxe2x80x9d to its previous state. Memory changes are only made permanent (or xe2x80x9ccommittedxe2x80x9d) upon successful process completion. In this context, xe2x80x9ctransactionsxe2x80x9d may be considered those intervals between initial system accesses that may ultimately permanently affect the system state, and the xe2x80x9ccommittalxe2x80x9d of the state changes to the system. This shared memory system is referred to as a transactional system.
Further, a multiprocessor, shared memory computing system is disclosed in co-pending U.S. patent application Ser. No. 08/997,776, entitled xe2x80x9cComputing System having Fault Containmentxe2x80x9d, by Barry Wood et al. and assigned to Northern Telecom Limited, the contents of which are also herein incorporated by reference. The multiprocessor system comprises a plurality of processing element modules, input/output processor modules and shared memory modules interconnected with the processing elements and input/output processors. The modules are interconnected by point to multi-point point communication links. Shared memory is updated and read by exchanging frames forming memory access transactions over these links.
Unfortunately, in the case of these novel multiprocessor, shared memory computing systems, the above described fault recovery and manual reset solution does not work, specifically due to the fact that the reset lines act as single points of failure for the entire system.
The background information provided above clearly shows that there exists a need in the industry to provide an improved method for ensuring the proper functionality of a multiprocessor, shared memory computing system.
The present invention is directed to fault recovery in a multiprocessor computing system. Such systems typically comprise a plurality of Processing Element (PE) modules, Input/Output Processor (IOP) modules and shared memory modules interconnected with the processing elements and input/output processors.
In a specific example the present invention permits three types of sanity detection These are PE sanity, System sanity, and Scheduler sanity. PE sanity detection includes the ability to detect when a PE has become xe2x80x9clocked upxe2x80x9d due to hardware or software errors. System sanity detection includes the ability to detect conditions whereby processes are no longer able to work their way through the operating system time queues, onto the ready queue, and finally onto some processor. Scheduler sanity detection includes the ability to detect damage to scheduler data structures, as well as the detection of insane scheduler interrupt code.
In summary, the invention provides a novel PE that features a watchdog timer and a sanity timer. In operation both timers run until a certain limit count is reached. When this event occurs, the timer expires and issues a reset command that causes the PE module to be reset and taken out of service. In order to keep the PE running both timers must be cleared (set to an initial count value) before the limit count is reached.
The watchdog timer is controlled by the scheduler of the operating system. The purpose of the scheduler is to periodically assign PE processing time to different processes. The scheduler itself is a block of operating system code that is run by any one PE in order to effect task switching, in other words, switch from one process to another. The execution of this block of code causes the watchdog timer to be cleared. If the PE has locked-up, the scheduler will not be able to execute properly on the PE and the watchdog timer will reach the limit count that will cause the PE to be reset.
The sanity timer is controlled by a high priority system audit process, referred to as SYSMON. SYSMON is an operating system block of code whose execution is managed by the scheduler as any other utility process that may be run on the computer system at any given time. In a most preferred embodiment, each Field Replaceable Unit (FRU) of the computer system, such as PEs, IOPs and memory modules, includes a sanity timer. When SYSMON is run it causes the generation of an external clear signal for each FRU, where this external clear signal clears the FRU sanity timer. Thus, if SYSMON is not run, all of the sanity timers will expire at very close to the same time and reset their respective modules, resulting in a system-wide reset. SYSMON is useful for protecting against faults where the scheduler may be running properly but processes may not be able to execute in the appropriate manner.
The invention also provides a computer readable storage medium including a program element for execution by a multiprocessor computing system implementing a scheduler capable to clear the watchdog timer of a PE and a process to clear the sanity timer of a PE.