1. Technical Field
The present invention relates generally to the field of computer systems and, more specifically to a data processing system, method, and product in a logically partitioned system for preserving trace data after a partition crash.
2. Description of Related Art
A logical partitioning option (LPAR) within a data processing system (platform) allows multiple copies of a single operating system (OS) or multiple heterogeneous operating systems to be simultaneously run on a single data processing system hardware platform. A partition, within which an operating system image runs, is assigned a non-overlapping subset of the platform's hardware resources. These platform allocable resources include one or more architecturally distinct processors with their interrupt management area, regions of system memory, and input/output (I/O) adapter bus slots. The partition's resources are represented by its own open firmware device tree to the OS image.
Each distinct OS or image of an OS running within the platform is protected from each other such that software errors on one logical partition can not affect the correct operation of any of the other partitions. This is provided by allocating a disjoint set of platform resources to be directly managed by each OS image and by providing mechanisms for ensuring that the various images can not control any resources that have not been allocated to it. Furthermore, software errors in the control of an operating system's allocated resources are prevented from affecting the resources of any other image. Thus, each image of the OS (or each different OS) directly controls a distinct set of allocable resources within the platform.
Many logically partitioned systems make use of a hypervisor. A hypervisor is a layer of privileged software between the hardware and logical partitions that manages and enforces partition protection boundaries. The hypervisor is also referred to as partition management firmware. The hypervisor is responsible for configuring, servicing, and running multiple logical systems on the same physical hardware. The hypervisor is typically responsible for allocating resources to a partition, installing an operating system in a partition, starting and stopping the operating system in a partition, dumping main storage of a partition, communicating between partitions, and providing other functions. In order to implement these functions, a hypervisor also has to implement its own low level operations like main storage management, synchronization primitives, I/O facilities, heap management, and other functions.
Typically the hypervisor includes a trace buffer. A trace facility routine executes within hypervisor. The trace facility writes trace data into trace buffer. This single trace buffer is used for all partitions in the logically partitioned system to record trace data. The trace buffer is of a limited size. Therefore, the data continues to be overwritten by new trace data.
When an error occurs within the logically partitioned system, an exception handler routine writes trace data related to the error to the trace facility. This trace data may be very important to have when evaluating the cause and/or effect of the error. Because a single trace buffer is used to record all trace data associated with each partition, a small delay in the retrieval of the current trace data in the buffer will result in the data being lost because it is constantly being overwritten. Thus, the data related to the error will be overwritten if not retrieved quickly after the occurrence of the error.
Therefore, a need exists for a method, system, and product in a logically partitioned system for preserving trace data after a partition crash.