1. Technical Field
The present invention relates to managing partitioned systems. More particularly, the present invention relates to a system and method for reporting platform errors that are detected by the platform and reported to more than one partition within a computer system.
2. Description of the Related Art
Logical partitioning is the ability to make a single multiprocessing system run as if it were two or more independent systems. Each logical partition represents a division of resources in the system and operates as an independent logical system. Each partition is logical because the division of resources may be physical or virtual. An example of logical partitions is the partitioning of a multiprocessor computer system into multiple independent servers, each with its own processors, main storage, and I/O devices. One of multiple different operating systems, such as AIX, LINUX, or others, can be running in each partition.
In a Logically Partioned (LPAR) multiprocessing system, there are a class of errors (Local) that are only reported to the assigned or owning partition""s operating system. Failures of I/O adapters which are only assigned to a single partition""s operating system are an example of this. There is also another class of errors (Global) that are reported to each partition""s operating system because they could potentially affect each partition""s operation. Examples of this type are power supply, fan, memory, and processor failures.
Logical partitioning is in common use today because it provides its users with flexibility to change the number of logical partitions in use and the amount of physical system resources assigned to each partition, in some cases while the entire system continues to operate. Logical partitioning is also used because certain applications or work environments may require a particular operating system.
For example, in a home-based business, a particular business application may be written for IBM""s AIX(copyright) operating system, while another home application may be written for Microsoft xe2x80x9cWindowsxe2x80x9d operating system (such as Windows 98(copyright) or Windows 2000(copyright)). Rather than having separate computer systems for the various operating systems and applications, logical partitions allow the different applications and operating systems to be executed on the same physical machine. All of the operating systems can be loaded on one or more nonvolatile storage devices, such as hard disk drives (HDD), accessible by the computer system.
In some system environments, diagnostics are executed on the computer system periodically to determine whether the computer system requires maintenance. Services are provided to automatically receive reports from computer systems detailing the maintenance required. The diagnostic software is often included with the operating systems. Because each of the operating systems is using the same underlying hardware, the diagnostics for each operating system in a logically partitioned system is likely to detect and report the same error. In an automated service environment, having multiples of the same errors reported may cause confusion and inefficiencies when servicing the systems. For example, if the AIX operating system detected that a firmware card within the computer was failing, it may send a report to one service organization to install a replacement card in the system. At the same time, another operating system loaded in the machine may report the same problems causing either the same service organization or a different service organization to take action to replace the defective card.
What is needed, therefore, is a way of efficiently noting whether a hardware error has already been reported to one of the operating systems installed on a logically partitioned system.
It has been discovered that a flag can be used to detect when a hardware error has already been reported to prevent duplicate servicing of the same hardware component. Computer system hardware and firmware cards have multiple components for providing a particular function, such as a video display and communications, to the user. One of these components is a firmware error buffer where information identifying errors that have been detected in hardware are stored. In addition to the error identifiers, an Already Reported Flag (ARF) is included to indicate whether the error has been reported to at least one operating system.
When an error is first reported, the ARF is set to xe2x80x9cnoxe2x80x9d (i.e., xe2x80x9c0xe2x80x9d). After the first operating system requests error information and receives the error identifier, the ARF is set to xe2x80x9cyesxe2x80x9d (i.e., xe2x80x9c1xe2x80x9d), indicating that the corresponding error has been provided to one of the operating systems. Subsequently, when another operating system requests error information and retrieves the errors stored in the error buffer, the ARF will be used to indicate that the particular error has already been reported to one of the operating systems.
When the operating system retrieves the errors using diagnostics, it will create a report of detected errors in order to take corrective action to repair or maintain the computer system. For example, the errors with the ARF set to xe2x80x9cnoxe2x80x9d can be highlighted to inform the user or service organization that these errors are newly reported. On the other hand, the report may note which errors have previously been reported so that a service or individual does not replace a component more than once.
The foregoing is a summary and thus contains, by necessity, simplifications, generalizations, and omissions of detail; consequently, those skilled in the art will appreciate that the summary is illustrative only and is not intended to be in any way limiting. Other aspects, inventive features, and advantages of the present invention, as defined solely by the claims, will become apparent in the non-limiting detailed description set forth below.