1. Technical Field
The present invention relates generally to the field of computer systems and, more specifically to a system, method, and computer program product for executing a reliable warm reboot of a partition that includes multiple processors in logically partitioned systems.
2. Description of Related Art
A logical partitioning option (LPAR) within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system hardware platform. A partition, within which an operating system image runs, may be assigned a non-overlapping subset of the platform's hardware resources. In some implementations, a percentage of system resources is assigned such that system resources are essentially time-sliced across partitions. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by its own resources list typically created and maintained by the systems underlying firmware and available to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition can not affect the correct operation of any of the other partitions. At a given time, this is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images can not control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus at a given time, each image of the OS, or each different OS, directly controls a distinct set of allocable resources within the platform.
The ability to reboot one of multiple partitions in a logically partitioned system is an important requirement in such a system. This requirement stems from the fact that partitions are supposed to act and behave like independent systems. An independent computer system may be restarted using either a cold reboot or a warm reboot.
A cold reboot is defined as restarting the computer system by cycling the power to the computer system off and then back on. When a cold reboot is executed, the various hardware components in the system are reset to a particular, defined state. When a processor is reset, the processor loses all history of what it had been doing prior to being reset. It does not continue to transmit I/O requests, and does not anticipate the receipt of any particular I/O response. When an I/O adapter is reset, it also does not continue to transmit I/O responses, and does not anticipate the receipt of any particular I/O request.
A warm reboot is defined as restarting the computer system without cycling the power off and then back on. In a cold boot, system components are tested prior to initialization to ensure that the hardware is functioning properly before control can be passed to the OS. In a warm boot scenario, since the system is assumed to be operating prior to the reboot request, testing of certain system components can be skipped thereby speeding up the boot. Obviously in an LPAR environment, a cold boot is not an option since it impacts not only the partition being rebooted but all other partitions as well because the power to the system is cycled on and then back off.
Typically, a warm reboot is executed from the operating system level. During a warm reboot of a partition that includes multiple processors, I/O activity in the partition being rebooted may continue. Processors in the partition that are not the processor that initially received the reboot request may be transmitting data to an I/O adapter when the reboot request occurs. In addition, the I/O adapters may be transmitting data back to the processors.
It is not practical, however, to reboot a partition in a logically partitioned system using the same, cold reboot method used in independent systems. When an independent system is rebooted using a cold, or hard reboot, the power of the system is cycled off and then back on. When a reboot of an independent system is executed, in most cases it is treated in the same way as a cold reboot. Thus, when an operating system initiates a reboot and the power is cycled off and then back on for the system. This approach is not practical for rebooting only one of the multiple partitions of a logically partitioned system. Power to the logically partitioned hardware cannot be cycled for just one partition. Cycling the power would affect all partitions.
When a reboot request is issued from the OS to reboot a partition, it is sent to one of the processors in the partition. This processor can control the processes/tasks running on it so prior to passing the reboot request to firmware it is able to cease all I/O activity to and from it. If the partition had only one processor, there is a mechanism to stop all I/O activity prior to the start of a partition reboot. In the case where a partition consists of multiple processors, the other processors have no knowledge of the reboot request until the information in conveyed to them by the processor that received the reboot request. Since there is no way to send a simultaneous request to all processors in a partition, during the time that it takes the “receiving” processor to inform the other processors in the partition of the pending reboot request, they may have already initiated I/O transaction(s). These pending I/O transactions cause problems when the system is being rebooted.
Executing a warm reboot in just one partition of a logically partitioned system can cause unreliable results when the partition includes multiple processors. In the prior art, when a warm reboot occurs in a partition that includes multiple processors, one processor will receive the request to reboot. That processor will then tell the other processors to stop processing in preparation for a reboot. A problem occurs when one or more of these other processors has one or more outstanding I/O requests as a reboot is initiated. When the reboot occurs, system firmware is in control. As it proceeds to reboot the system an I/O adapter may respond to an I/O request issued prior to the reboot request from one of the processors. However, the processor that originally transmitted the request is not executing the task which produced the request. The firmware in effect receives an unsolicited I/O interrupt. Unable to determine if the I/O response is a result of an I/O problem or a previously issued request, the reboot fails.
Therefore, a need exists for a method, system, and product for executing a reliable reboot in a partition in a logically partitioned system where the partition is comprised of multiple processors.