1. Technical Field
The present invention relates generally to computer systems, and more specifically to a computer system having a central processing system and a service processor. In particular, the present invention allows a service processor and central processor to cooperate in fault recovery via registers within the central processing system accessible through a test port interface and interrupts provided to the central processing system.
2. Description of the Related Art
Modern computer systems have grown sufficiently complex that secondary service processors are used to provide initialization of the computer systems, component synchronization and, in some cases, startup assistance to components that do not completely self-initialize. In addition, data values and instructions are pre-loaded, and out-of-order execution is supported, making synchronization and reliability of the processing cores critical to proper operation. When an error occurs, re-synchronizing the contents and coherence state of all of the caches in a computer system can be a complex tracing problem. In addition, other errors may occur in systems components in which error detection may be made by an operating system running on a main processor, but a recovery mechanism is only available to the service processor. Likewise, the service processor may be able to detect an error, but the operating system may need information to either attempt recovery or participate in a recovery mechanism engaged by the service processor. For example, the service processor may be able to reset a cache memory controller while a main processor may not, but the contents of cache must be flushed by the operating system so that the system memory image is not corrupted.
As the speed of processors increases, the use of dynamic circuits and asynchronously timed interconnects force modern processing system designs toward fault tolerant operation. In addition, processing systems must be designed to handle certain fault rates, as opposed to past processing systems in which a single fault usually requires halting execution of a processor to wait for the correction of the fault. Fault tolerance in the past has been directed at handling software faults that occur due to the difficulty of handling all combinations of execution that might occur on one or more processors in a particular sequence of instructions. The new trends in circuit designs increase the need for tolerance of hardware faults, which have been corrected in the past by a hardware reset.
The need for fault tolerant designs comes in part from the increasing demand for reliability and increasing processing speeds from consumers. One way to increase the rate at which a circuit can evaluate the next state in a computational engine is to permit an increase in the error rate for that evaluation.
A single processor system can easily reset the processor core. For a multiprocessor system, core-resetting is not a simple operation, as the interdependencies of memory values based on cache storage raise the potential to corrupt computations being performed on the entire machine. In addition, core-resetting typically requires shutdown and subsequent restart of the operating system.
In light of the foregoing, it would be desirable to provide a method and apparatus for fault recovery in a multiprocessing system.
A data-processing system includes a service processor and a main processor communicating via an operating system and an interface register within the main processor that can be accessed through a test port interface.
The data-processing system also includes at least one memory, a test port for coupling the main processor to the service processor, and an interface register within the main processor coupled to the test port for exchanging information between the operating system and the service processor. An interrupt connection from the interface register to the main processor execution units provides an indication to the operating system so that information written by the service processor via the test port may be provided to the operating system without polling. Additionally, an attention indication to the service processor is triggered by the operating system writing information to the interface register, such that the service processor may retrieve the information.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.