1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method of detecting which of a plurality of hardware devices in a computer system are failing, resulting in hanging of the computer system.
2. Description of Related Art
The basic structure of a conventional multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has several processing units, two of which 12a and 12b are depicted, which are connected to various peripheral devices, including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as random access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units 12a and 12b communicate with the peripheral devices by various means, including a generalized interconnect or bus 20. Computer system 10 may have many additional components which are not shown, such as serial and parallel ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access memory 16, etc. The computer can also have more than two processing units.
A processing unit includes a processor core 22 having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. An exemplary processing unit includes the PowerPC(trademark) processor marketed by International Business Machines Corp. The processing unit can also have one or more caches, such as an instruction cache 24 and a data cache 26, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory 16. These caches are referred to as xe2x80x9con-boardxe2x80x9d when they are integrally packaged with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit can include additional caches, such as cache 32, which is referred to as a level 2 (L2) cache since it supports the on-board (level 1) caches 24 and 26. In other words, cache 32 acts as an intermediary between memory 16 and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache 32 may be a chip having a storage capacity of 512 kilobytes, while the processor may be an IBM PowerPC(trademark) 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 32 is connected to bus 20, and all loading of information from memory 16 into processor core 22 must come through cache 32. Although FIG. 1 depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels (L3, L4, etc.) of serially connected caches.
As computer systems have become more complex, it has contemporaneously become more difficult to determine the cause of computer malfunctions, in spite of extensive factory testing. Some malfunctions are more serious than others. For example, if an error occurs when a value is read from or written to the system memory device, a parity checking technique with built-in error control is often able to automatically correct the error, and the computer may continue operation with practically no noticeable interruption. More serious errors may generate interrupt signals which can temporarily delay computer processing. These interrupts can require various components to be reset, or may call interrupt handlers, monitoring routines or debugging software in order to deal with, and possibly determine the cause of, the problem.
In the most serious cases, a hardware failure can cause a computer component to halt operation, a fault condition referred to as a xe2x80x9chang.xe2x80x9d When the component hangs, the entire computer system must usually be reset, that is, the power turned off and then back on again. This situation is not only inconvenient to users, but can further result in grievous loss of data, or crucial loss of control for an operation-critical system. These failures may arise either due to a soft error (a random, transient condition caused e.g., stray radiation or electrostatic discharge), or due to a hard error (a permanent condition, e.g., a defective transistor or interconnect line). One common cause of errors is a soft error resulting from alpha radiation emitted by the lead in the solder (C4) bumps used to form wire bonds with circuit leads.
It is accordingly important to be able to determine the true cause of a system failure (or as close as possible to the true cause) in order to address the problem and carry out appropriate repairs or replacement, as well as implement new engineering solutions for later manufacturing. However, in modern day systems having greater depth, when a computer access must go through several layers of devices to be serviced, it is often difficult or impossible to determine which component has caused the primary problem.
Consider for example a simple read operation. Referring to FIG. 1, a processor core such as 22 loads an instruction to retrieve (read) a particular data value (operand data) for further processing. In a problem-free system, when the processor executes the read operation, it passes the request down to data cache 26. If data cache 26 does not hold a valid copy of the requested value, then the request is passed to the L2 cache 32. If the value is also not present at L2 cache 32, then the request is passed down in a similar manner to lower levels of the memory hierarchy (if additional cache levels are present), until it is received by system memory 16. The value may not be in system memory, if it has temporarily been placed on a permanent storage device (hard disk drive, or HDD), e.g., in a xe2x80x9cvirtual memoryxe2x80x9d configuration. In such a case, the value must further be retrieved from the I/O device 14. Once the value is located, it is passed back up the memory hierarchy and loaded into processor core 22.
If any level in this access chain fails, then the entire system may hang. Under these circumstances, it is often unclear which component has actually caused the problem. It is sometimes necessary to have field diagnostics performed to determine the cause, which can be very expensive. Alternatively, several components might have to be replaced if the single failing component cannot be specifically identified. It would, therefore, be desirable to provide an improved method of indicating which component has caused a computer system to halt operation. It would be further advantageous if the method could allow a more accurate diagnostic call, or simplify debugging of the hang.
It is therefore one object of the present invention to provide an improved computer system.
It is another object of the present invention to provide an improved method of diagnosing operational problems in a computer system.
It is yet another object of the present invention to provide such a method which detects a primary component causing the computer system to hang.
The foregoing objects are achieved in a method of detecting a hang in a computer system, wherein the computer system includes a processing unit and a memory subsystem providing one or more access layers, generally comprising the steps of generating a plurality of hang strobe signals (including at least a first hang strobe signal for the processing unit, and a second hang strobe signal for the memory subsystem), detecting that a hang has occurred in the computer system using the hang strobe signals, and determining whether the hang occurred in the processing unit or in the memory subsystem. The intervals of the hang strobe signals may be programmably set or tuned to adjust the detection mechanism for the given system configuration. The hang strobe signals may have different intervals, and preferably the first hang strobe signal has an interval that is longer than an interval of the second hang strobe signal. More than two strobe signals may be provided, e.g., the generating step may further generate a third hang strobe signal for the memory subsystem, wherein the second and third hang strobe signals are applied to different access layers of the memory subsystem. The detecting step may be accomplished in part by calculating a number of hang pulses that have issued during pendency of a processor instruction, and then selectively comparing the number to either a first hang limit value associated with the processing unit, or a second hang limit value associated with the memory subsystem. This selection may be based on a signal indicating whether any requests are still pending in the memory subsystem. In one embodiment, the determining step also uses the signal indicating whether any requests are still pending in the memory subsystem. The hang limit values can also be programmably set.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.