Many different types of computing systems have attained widespread use around the world. These computing systems include personal computers, servers, mainframes, work stations, and a wide variety of stand-alone and embedded computing devices. Sprawling client-server systems exist, with applications and information spread across many PC networks, mainframes and minicomputers. In a distributed system connected by networks, a user may access many application programs, databases, network systems, operating systems and mainframe applications. Computers provide individuals and businesses with a host of software applications including word processing, spreadsheet, web-publishing, databases, and accounting. Further, networks enable high speed communication between people in diverse locations by way of e-mail, websites, instant messaging, and web-conferencing.
At the heart of every computer, server, workstation and mainframe is at least one microprocessor. A common architecture for high performance, microprocessors is the reduced instruction set computer (RISC) architecture characterized by a small simplified set of frequently used instructions for rapid execution. Thus, in a RISC architecture, a complex instruction comprises a small set of simple instructions that are executed in steps very rapidly. These steps are performed in execution units adapted to execute specific simple instructions. In a superscalar architecture, these execution units typically comprise load/store units, integer Arithmetic/Logic Units, floating point Arithmetic/Logic Units, and Graphical Logic Units that operate in parallel. In a processor architecture, an operating system controls operation of the processor and components peripheral to the processor. Executable application programs are stored in a computer's hard drive. The computer's processor causes application programs to run in response to user inputs.
In multi-processor systems, a service processor (SP) serves a central electronics complex (CEC) which contains multiple processors. The SP comprises firmware for operation of the processors in the CEC. More particularly, the SP has boot firmware and host firmware. The boot firmware runs on the SP. It boots the SP during Initial Program Load (IPL); causes the host firmware to load in the CEC; and then continues to run to monitor the hardware and to correct errors if errors occur. The host firmware runs on processors in the CEC and serves customer software applications. The host firmware is downloaded into RAM in the CEC and starts to run once the boot firmware completes the boot process.
The service processor also comprises a JTAG (Joint Test Action Group) engine. The JTAG engine is a device that provides a means to transfer data to and from its buffer to a designated chip in the CEC. Joint Test Action Group (JTAG) is the usual name used for the IEEE (Institute of Electrical and Electronics Engineers) 1149.1 standard entitled Standard Test Access Port and Boundary-Scan Architecture for test access ports used for testing printed circuit boards using boundary scan. JTAG was standardized in 1990 as the IEEE Std. 1149.1-1990. In 1994, a supplement that contains a description of the boundary scan description language (BSDL) was added. Since then, this standard has been adopted by electronics companies all over the world. While designed for printed circuit boards, JTAG is primarily used for testing sub-blocks of integrated circuits, and is also useful as a mechanism for debugging embedded systems, providing a convenient “back door” into the system. When used as a debugging tool, an in-circuit emulator which in turn uses JTAG as the transport mechanism enables a programmer to access an on-chip debug module which is integrated into a CPU (Central Processing Unit) via JTAG. The debug module enables the programmer to debug the software of an embedded system.
Thus, a JTAG engine is a device that provides a means to transfer data to or from a designated chip in the CEC of a multiprocessor system. Suppose, for example, one desires to transfer data to a chip in the CEC. The firmware running on the SP will define the data and send it to a buffer in the JTAG engine. The JTAG engine will shift this data into the chip. The reverse is true for transferring data from a chip.
Today, computer systems with high availability requirements use various error detection logic methods to ensure customer data integrity. When an error occurs in the system, it is reported to the Service Processor by way of an interrupt for further error analysis and fault isolation, so the a correct hardware part replacement can be determined. For a critical system error, the Service Processor extracts additional hardware state data by way of a “dump” process, then reboots the system as part of the overall system recovery. For a non-critical system error or event, the Service Processor performs analysis and assists in error recovery where applicable. The communication between the Service Processor and the system hardware is via a “Service Bus” (or JTAG) and a Scan engine. When a hardware error occurs in either the JTAG service bus or the Scan engine, the Service Processor loses the ability to analyze and determine the criticality of the real system error or event when it is reported to the Service Processor. To ensure maximum customer data integrity, the Service Processor treats the JTAG service bus or Scan engine error as a system critical error by extracting additional hardware state data by way of a “dump” process, then reboots the system to clear/reset all errors.
The drawbacks of the current design are that the system hardware error or event that reported to the Service Processor can also be a recoverable error or other non-system critical even, in addition to a critical error. When the Service Processor loses the ability to access the hardware to determine the reason for an interrupt, the Service Processor assumes the worst case, thus minimizing system availability in order to maximize customer data integrity.
What is needed is a recovery method to overcome an intermittent or transient JTAG service bus or scan engine error, so that the SP can continue with analyzing the actual error and determine the correct error criticality.