1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for processor fault isolation.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
One of the areas that has seen considerable advancement is multiprocessing, using more than one processor in a single computer. In such system, detecting faults in a single processor is a challenge. When a processor fails, it may not respond to in-band interrogation, normal processor commands presented through a main bus or a front side bus. One method of fault isolation therefore is to interrogate a processor out-of-band, through a JTAG port on the processor, for example. ‘JTAG’ is an acronym for Joint Test Action Group and is the name usually used to refer to the IEEE 1149.1 standard entitled Standard Test Access Port and Boundary-Scan Architecture. JTAG is a standard for test access ports used for testing printed circuit boards and components (including computer processors) using boundary scan. Boundary scan is a method for testing interconnects (thin wire lines) on printed circuit boards or sub-blocks inside of an integrated circuit. The boundary scan standard referred to as JTAG has been so widely adopted by electronic device companies all over the work that today ‘boundary scan’ and ‘JTAG’ are practically synonyms. In this specification, however, ‘boundary scan’ and ‘JTAG’ are not treated as synonyms. ‘Boundary scan’ as the term is used here refers to boundary scan operations generally, while ‘JTAG’ is used to refer to boundary scans according to the JTAG standard. That is, in this specification, JTAG is treated as an example of one kind of boundary scan, admittedly as widely used example, but nevertheless, just one example. The term ‘boundary scan’ includes not only JTAG, but also any kind of boundary scan that may occur to those of skill in the art.
The boundary scan architecture provides a means to test interconnects and clusters of logic, memories, and other circuit elements without using physical test probes. It adds one or more so called ‘test cells’ connected to each pin of a device that can selectively override the functionality of that pin. These cells can be programmed through a JTAG scan chain to drive a signal onto a pin and across an individual trace on the board. The cell at the destination of the board trace can then be programmed to read the value at the pin, verifying the board trace properly connects the two pins. If the trace is shorted to another signal or if the trace has been cut, the correct signal value will not show up at the destination pin, and the board will be known to have a fault.
When performing boundary scan inside integrated circuits, boundary scan latch cells, sometimes called ‘test cells’ or ‘latch cells’ or just ‘latches,’ are added between logical design blocks in order to be able to control them in the same manner as if they were physically independent circuits. For normal operation, the added boundary scan latch cells are set so that they have no effect on the circuit, and are therefore effectively invisible. Then when the circuit is set into a test mode, the latches enable a data stream to be passed from one latch to the next, serially, in a so-called ‘scan chain.’ As the cells can be used to force data into the board, they can set up test conditions. The relevant states can then be fed back into an external test system by clocking the data word back serially so that it can be analyzed.
By adopting this technique, it is possible for a test system to gain test access to a board or to internal logic in an integrated circuit such as a computer processor or computer memory module. As most of today's boards are very densely populated with components and tracks, it is very difficult for test systems to access the relevant areas of the board to enable them to test the board. Moreover, most of the internal logic within an integrated circuit is not externally connected through pins or pads so that an external test system can access them at all. Boundary scan makes these things possible.
During product development, a JTAG port is normally connected to an external test system, such as, for example, AMD's Hardware Debug Tool or an American Arium, to read processor registers and control processor operations for test. In this configuration, all processors installed in the computer under test are in a single JTAG chain. In order to communicate with a specific processor, all others in the chain need to be placed in BYPASS mode to allow JTAG commands to pass through them. While his method is fine for code development, it poses challenges when using the processor's JTAG port for fault isolation purposes. One problem is that if a single processor's fault is catastrophic enough to render its JTAG port inoperable, the chain is broken. This would prevent communication with other processors in the chain which may still be viable and may hold clues to what went wrong.
Another problem is that placing a processor in BYPASS mode typically is an operation effected out-of-band through a microcontroller, such as for example a Baseboard Management Controller (‘BMC’). A BMC is a specialized microcontroller embedded on the motherboard of many computers, especially servers. The BMC is the intelligence in the Intelligent Platform Management Interface (‘IPMI’) architecture. The BMC manages the interface between system management software and platform hardware.
Different types of sensors built into the computer system report to the BMC on parameters such as temperature, cooling fan speeds, power mode, operating system status, processor operations, and so on. The BMC monitors the sensors and can send alerts to a system administrator via the network if any of the parameters do not stay within preset limits, indicating a potential failure of the system. The administrator can also remotely communicate with the BMC to take some corrective action such as resetting or power cycling the system to get a hung operating system running again. These abilities save on the total cost of ownership of a system.
Physical interfaces to the BMC include System Management Buses (‘SMBs’), an RS-232 bus, address and data lines and an Intelligent Platform Management Bus (‘IPMB’), that enables the BMC to accept IPMI request messages from other management controllers in the system. The BMC communicates with a BMC management utility (‘BMU’) on a remote client using IPMI protocols. The BMU is usually a command line interface (‘CLI’) application. The Intelligent Platform Management Interface (IPMI) specification defines a set of common interfaces to computer hardware and firmware which system administrators can utilize to monitor system health and manage the system.
IPMI operates independently of the operating system and allows administrators to manage a system remotely even in the absence of the operating system or the system management software, or even if the monitored system has not powered on. IPMI can also function when the operating system has started, and offers enhanced features when used with the system management software. IPMI enables sending out alerts via a direct serial connection, a local area network (‘LAN’) or a serial over LAN (‘SOL’) connection to a remote client. System administrators can then use IPMI messaging to query platform status, to review hardware logs, or to issue other requests from a remote console through the same connections. The standard also defines an alerting mechanism for the system to send a simple network management protocol (‘SNMP’) platform event trap (‘PET’).
System management microcontrollers such as BMCs are small embedded devices that contain a small processor, a small quantity of memory in which is stored a microcontroller control program, and one or more I/O ports. The mechanics of putting a processor in BYPASS mode represents a large operational burden on such an embedded microcontroller. It is a benefit therefore that the microcontroller program code required for such a microcontroller to interrogate processors and isolate faults is kept as simple and streamlined as possible.
In addition, when an external device is using the JTAG chain or debug/development purposes, processors are normally included or excluded from the chain via on-board mechanical switches or jumpers. This requires manual chain configuration or reconfiguration if the number or position of processors in the system is changed.