Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method of isolating a fault in a complex computer system having a number of serially-connected (FSI chained) devices.
Description of the Related Art
As computer systems become increasingly complex with more interconnected devices, there are ever greater chances of errors arising within the system, and it becomes more difficult to diagnose the cause of these errors. Hardware-based operating errors can result in a period of downtime in which the computer is unavailable for use. For multi-user (or clustering computing environment) computers, such as mainframe computers, midrange computers, supercomputers, and network servers, the inability to use a particular computer may have a significant impact on the productivity of a large number of users, particularly if an error impacts mission-critical applications (e.g., when processing bank transactions). Multi-user computers are typically used around the clock, and as a result, it is critically important that these computers be accessible as much as possible.
Hardware concurrent maintenance is often utilized to address the problems associated with computer downtime. Hardware concurrent maintenance is a process of performing maintenance on computer hardware while the computer is still running, thereby resulting in minimal impact to user accessibility. Conventional hardware concurrent maintenance typically requires that maintenance personnel physically remove one or more field replaceable units (FRUs) from a computer system. FRUs may be packaged in a very complex fashion and/or require special tools to enable removal without causing hardware damage.
Server systems generally have many field replaceable units (FRUs). FIG. 1 depicts one example of a conventional server system 10. In this particular example, the server is controlled by a hardware management console (HMC) 12. HMC 12 is a dedicated workstation that provides a graphical user interface for configuring, operating, and performing basic system tasks for the server, including tasks related to the management of the physical server components and tasks related to virtualization features such as the logical partition configuration or dynamic reconfiguration of resources. HMC 12 communicates with a system controller 14a via an Ethernet connection to an Ethernet controller integrated into the system controller FSP chip. System controller 14a provides system initialization and node management, including error reporting. Inter-device communications may be implemented in server system 10 using a flexible service processor (FSP) located at the system controller. A flexible service processor is similar to a service processor, and may include for example a PowerPC™ processor having engines to drive communications interfaces. A redundant system controller 14b is provided with a point-to-point FSI link between the FSP chips in the system controllers. A plurality of server nodes 16a-16d carry out the main functions of the server, and may constitute a variety of interconnected devices, including multiple processors (primary and support), system memory and cache memories, fabric repeaters, sensors, etc.
FIG. 1 shows how an FSP can have a downstream fanout to other components via a serial link referred to as an FRU support interface (FSI) which is used to reach the endpoint controls (similar interconnections from the FSP in redundant system controller 14b are not shown for simplicity). In this example the endpoints are common FRU access macros (CFAMs) which may be integrated into the microprocessors or other devices such as input/output (I/O) application-specific integrated circuits (ASICs). CFAMs have a standardized interconnect design, and provide FRU support for a variety of control interfaces such as JTAG, UART, I2C (IIC), GPIO, etc. CFAMs can have multiple FSI slaves with a hardware arbiter to allow multiple FSI masters on the support processors, etc., to access the downstream components. The components may be interconnected via multiple CFAMs acting as hubs or links. Hub links are high function links used specifically between processors. Accordingly, instead of an engine in the FSP directly controlling a device, multiple engines linked serially can pass control data to the device (FSI chaining).
In the case of a hardware failure within server system 10, code running on one of the system controllers generates an error log that includes one or more components suspected of being defective (the FRU callout list). A service call is then made to replace hardware associated with those FRUs. A typical FRU callout list includes any FRU having hardware associated with the failure, and may include FRUs that are not actually defective. Typically a platform-specific hard-coded look-up list is used to generate the FRU callout list. This approach is very static. For example, an error's callout may include all associated hardware along a path from a source (e.g., a service processor) to a destination (e.g., a thermal sensor or dual in-line memory module (DIMM)). The FRU callout list would have a minimum of one element with the upper bound determined by the hardware FRU boundaries crossed between the source and destination wiring of the interface.
Another method of generating FRU callout lists is to have the error monitoring application take the industry device driver error number (“errno”) and algorithmically try to isolate hardware failures. Often this is done by going to associated hardware (via different methods such as boundary scan or scom) to read hardware registers, states, or status to determine a logical reason for failure. Applications may also try to isolate the failure by communicating to other devices before and after a hub, or on another hub, trying deterministically to limit the FRUs on the callout list.