This invention relates generally to processor systems, and more particularly to error handling in processor systems.
Computer systems may have a number of layers of software, and there may be several processors in one system that are linked together by these layers. Further, a computer system may be made of a number of other computer systems linked together over a network, or a computer system may be one processor with a number of layers.
Thus, computer systems utilize layering of software. Generally, a layer of software will be responsible for handling a limited set of events or provide a certain level of abstraction. A layer of software is a set of instructions that are executed on a processor. A layer may control the hardware components of a system and provide higher level functionality to another layer, and a layer may handle networking functions at the lowest level. An example of such a layer is firmware. Firmware can be designed to interface with a certain type of processor.
Layers are arranged hierarchically in computer systems with one layer on top of another. Lower level layers are layers, such as firmware, that provide lower levels of abstraction. Higher level layers are layers, such as operating systems that provide higher level of abstraction. For example, a lower level layer may have to signal the read head on a hard drive and specify which platter to read data from to access data, whereas a higher level of software may just send a command to read a file to a lower level layer to access data.
Firmware is one type of lower layer in processor systems. Firmware refers to processor routines that are stored in non-volatile memory structures such as read only memories (ROMs), flash memories, and the like. These memory structures preserve the code stored in them even when power is shut off. One of the principle uses of firmware is to provide the routines that control a computer system when it is powered up from a shut down state, before volatile memory structures have been tested and configured. The process by which a computer is brought to its operating state from a powered down or powered off state is referred to as bootstrapping. Firmware routines may also be used to reinitialize or reconfigure the computer system following various hardware event and to handle certain platform events like system interrupts.
Firmware is typically written in assembly language. This is a low level computer language that provides direct access to processor hardware and is closely tied to the processor architecture. The processor architecture is reflected in the rest of the platform, in part because of the assembly level firmware that is used to initialize, configure, and service platform level resources. For example, platform resources may transfer data through specified registers and/or memory locations defined by the Instruction Set Architecture (ISA), and platform level interrupts may be handled by referring to specified processor registers. Thus, initialization and configuration of platform level resources are tied to the ISA of the underlying processor.
Operating systems (OS) are another layer of software. Operating systems are a higher layer than firmware. Operating systems interact with firmware to provide an environment in which applications can be run. Some examples of operating systems are DOS, Microsoft Windows, Microsoft Windows NT and Unix. By utilizing firmware, OS can be designed to run on many different processing systems without re-writing the OS for each variation in platforms. As an example, Microsoft Windows NT can run on single processor systems and some dual processor systems without recompiling or rewriting the OS. Operating systems can be designed to run on a variety of architectures. An Intel Architecture 64 bit operating system (IA-64 OS) is an operating system written using IA-64 code that runs all IA-64 applications (both IA-64 and IA-32 code). Two flavors of IA-64 OS are possible: one is a 32-bit IA-64 OS that uses 32 bits for its pointer variables and 64-bit IA-64 OS that uses 64 bits for its pointer variables. Such OS as have been described, allow applications to be written without regard for the underlying architecture.
By using layers of software, upper layers such as the OS and user applications in a multiprocessor system can interact with lower layers such as firmware as if the system is a single processor system. Layering permits software to be developed for a system without regard to the hardware making up the system, including the number of processors in that system.
In computer systems, different layers are responsible for detecting and handling different errors. Some layers may detect the error and tell a higher or lower layer of the error. Other layers may detect the error and handle the error.
In single processor systems, all layers are executing on the same processor. If an error occurs, that processor handles the error by executing the appropriate error handling hardware or routines. The error handling components or routines are part of the firmware or operating system.
In multiprocessor systems, sublayers or components of the firmware and operating system are executing on different processors. If an error is encountered by one processor, the other processors may continue executing without knowledge of the error. The error may be such that continued execution by the other processors propagates the error and causes further damage such as corrupted data. In a multiprocessor system, an error is more difficult to handle because the layers may not be able to communicate effectively. Furthermore, each processor may be executing separate firmware or firmware sublayers. Thus, an error in one processor may be detected by the firmware it is executing and the rest of the processors continue operating without knowledge of the error. This may cause the error to propagate and cause further errors in the system such as corrupting data.
Multiprocessor systems may have to reboot or shutdown for errors because of a lack of proper error handling. These errors may be handled in single processor based systems without shutting down.
For the reasons stated above, and for other reasons stated below which will become apparent to those skilled in the art upon reading and understanding the present specification, there is a need in the art for a microprocessor system which allows the use of multiple processors and appropriately handles system hardware and software errors.
The present invention provides systems and methods for error handling on multiprocessor systems.
In accordance with the present invention, a system comprises a non volatile memory and a plurality of processors. The non volatile memory stores an error handling routine. Each processor of the plurality of processors accesses the error handling routine on detecting an error and signals the remaining processors of the plurality of processors to enter a rendezvous state on certain errors.
A method comprises detecting an error. A rendezvous state is entered for correcting the error. The error is corrected and normal operation is resumed.
Other embodiments of systems and methods for error handling are disclosed.