1. Technical Field
The present invention relates generally to the field of computer architecture and, more specifically, to methods and systems for managing machine check interrupts during runtime.
2. Description of Related Art
As computers become more sophisticated, diagnostic and repair processes have become more complicated and require more time to complete. A service technician may xe2x80x9cchasexe2x80x9d errors through lengthy diagnostic procedures in an attempt to locate one or more components that may be causing the errors within the computer. Diagnostic procedures generally specify several possible solutions to an error or problem in order to guide a technician to a determination and subsequent resolution of the problem. However, diagnostic procedures generally point to a component that is a likely candidate for the error, and if the component is determined to be reliable, the problem may remain unresolved until the next error occurs. In addition to paying for new components, a business must also pay for the recurring labor costs of the service technician and lost productivity of the user of the error-prone computer.
Most computing systems use some sort of surveillance to help detect system problems during operation of the computing system. Surveillance is a communication system between the operating system, e.g. Advanced Interactive executive (AIX), and a support system, e.g. a service processor. With typical surveillance, both the operating system and the support system send xe2x80x9cheartbeatxe2x80x9d messages to each other on a periodic basis. If either does not receive the heartbeat message from the other within a given period of time, it assumes that the other component has failed. As a result, the failure will be logged in a corresponding error log indicating that a repair action is necessary. However, in some instances reporting a first error found in the machine check is not necessarily the actual cause of the machine check.
Therefore, a method and system to prioritize multiple errors reported from a PCI bus and order the errors in a systematic list would be desirable.
The present invention provides a method for prioritizing bus errors for a computing system. A subsystem test is executed on a first subsystem from a plurality of subsystems on a bus system, wherein the subsystem test on the bus system is specific to the first bus subsystem. An output is received in response to executing the subsystem test. In response to the output indicating an error on the first subsystem, a severity level is assessed based on the error. For all subsystems from the plurality of subsystems on the bus system, a subsystem test is executed on each remaining subsystem, wherein each subsystem test on the bus system is specific to each remaining subsystem. An output is received in response to executing each subsystem test. In response to the output indicating an error on any of the remaining subsystems, a severity level is assessed based on the error.