This invention relates generally to computer systems, and more particularly to identifying, capturing, isolating and diagnosing errors in computer system operation.
As is known in the art, a computer system can take the form of a workstation, server, personal computer, network appliance or, broadly speaking, other such general-purpose digital processing device. A computer system generally includes at least one central processing unit (CPU) that is used to execute computer instructions to perform various programming functions. The CPU communicates with other devices in the computer system through an interconnection subsystem, commonly called a bus. A system bus interconnects the CPU with main memory and can also connect, directly or indirectly, other devices of the computer system to the CPU, such as chip sets, graphic adapters, memory devices, and input/output (xe2x80x9cI/Oxe2x80x9d) devices, such as keyboards, monitors, scanners and printers.
In terms of performance, computer systems have achieved in recent years dramatically higher clock speeds with lower operating voltages. Increased clock speeds, measured usually in megaHertz (MHz), can allow computer applications to run faster and data to be transferred faster between devices. Lower operating voltages can advantageously reduce power consumption, which is important, for example, in miniaturization of integrated circuits and, in mobile computing, for extending battery operating times. Unfortunately, higher clock speeds can make accurate reception of bus signals more difficult, and lower operating voltages can make signals more susceptible to errors due to lower signal-to-noise ratios and resulting signal distortion.
Transient and other non-predictable errors in the signals within the computer system can arise from other causes as well, and often have a deleterious impact on computer system performance. Such errors can arise, for example, from manufacturers"" defects in devices connected in or to the computer system, as well as degradation over time of such devices. Errors can also arise due to non-compatibility of add-on components of the computer system, such as I/O devices and adapter cards, which are integrated into the computer system by customers, e.g., through xe2x80x9cplug and playxe2x80x9d operation. Where such devices malfunction, or simply exhibit operating parameters unanticipated by the original computer manufacturer, errors can arise. Such errors can result in lost or corrupted data, and, in extreme cases, such errors can cause system crashes.
Conventionally, the way to capture and isolate such errors has been through re-running the computer application during which the errors arose, with the devices instrumented to identify the errors, and provide error-related information to an external logic analyzer. An object of this approach is to identify the specific device that initially caused an error, i.e., that was responsible for the first occurrence of the error, also known as xe2x80x9cfirst failurexe2x80x9d. One difficulty with this approach lies in differentiating the first failure from other effects of the errors as they propagate through downstream devices of the computer system. Another drawback of this approach is that the instrumentation added to the devices for monitoring operation can affect the system, and even temporarily hide or modify a failure condition. Additional drawbacks include labor, downtime, and other costs related to the attachment of hardware instrumentation and the use of the external logic analyzer. It would be desirable to provide a technique for enabling errors arising in computer systems to be identified, captured, isolated, and diagnosed using a technique that overcomes at least a number of the difficulties of conventional approaches.
In accordance with the principles of the invention, in a failure management system, information regarding the operating conditions of a computer system is stored in a storage, which is dedicated to the failure management system. The storage is updated with the current operating conditions either periodically or upon the occurrence of predetermined events. When a first failure identification mechanism identifies a failure in the computer system, a capture mechanism interrupts the updating of the storage, leaving information regarding operating conditions which contributed to the failure in the storage. This latter information can then be read out to aid in diagnosis of the failure. Since the operating condition information is stored in a dedicated storage, the information is not modified by events that take place after the failure is identified.
More specifically, the computer system ordinarily holds state and other operating information in a set of storage devices, such as, for example, state registers. The dedicated storage device can be a shadow register or other shadow storage device for holding a separate dedicated copy of at least a portion of the operating information so that it is readily available in case a failure is detected. During operation, an updating mechanism continually transfers the information in the state registers to the shadow register until a first failure is detected. For example, this transfer can be carried out periodically or when the information in the state registers changes. When a failure is detected, a capture mechanism controls the updating mechanism to cease transferring information from the state registers to the shadow register. The shadow register can then output its contents, e.g., for analysis, preferably under computer program control.
The first failure management system can be implemented in a computer system. Conventional computer systems have a set of registers or other storage components for holding state information regarding execution of computer programs, and error flags of one or more bits indicative of error conditions. Computer systems can also be equipped with other storage components for holding other system information, such as, e.g., temperature within the computer""s housing, which may be useful to diagnose system operating errors. The first failure management system can include error logic responsive to the error flags from the storage components for generating a first failure indicating signal, which can be provided as an error notification signal output. The first failure management system can also include a shadow register chain, history queue or other shadow storage locations. The shadow storage locations receive a copy of at least a portion of the operating and error information from the storage components, and store that copy so that it is available in case of an error condition.
The control signal to which the shadow storage devices are responsive can be a special clock signal, for example, which controls shifting of the shadow register or overwriting of the history queue, so as to continue updating of the contents of that register or queue for so long as no error is detected. Upon error detection, the capture mechanism discontinues the clock signal, freezing the contents of the shadow register until such time that the contents can be provided as an output from the operating information capture mechanism.
Responsive to a control signal indicative of a first failure error, a scan controller can extract that information from the shadow storage locations and provide it as an output from the operating information capture mechanism. The scan controller can be implemented, e.g., as a service processor. A service processor is a processor that can scan the operating information in the shadow register chain and either provide that information as an output or execute an error-analysis program.
Accordingly, the invention can be used in identifying, capturing, isolating, notifying and diagnosing an error constituting the first failure in the system, and thus differentiating the first failure from other effects of that error as it may propagate through downstream devices of the computer system. The invention does not require the attachment of instrumentation or an external error analyzer because these components are preferably built into the system. Essentially, instrumentation implementing the invention can be formed directly on the same logic chip as the device that it is monitoring. Moreover, the invention can be used in automatically providing operating information, including the computer""s state as of the error condition, with significantly less labor, downtime, costs and untoward effects associated with prior art attachment of hardware instrumentation and the use of an external logic analyzer.