1. Technical Field
The present application relates generally to firmware defects. More specifically, the present application is directed to a computer implemented method and data processing system for preventing firmware defects from disturbing logic clocks to improve system reliability.
2. Description of Related Art
Most data processing systems use mechanisms for detecting, and possibly diagnosing, errors, as well as provide for the possibility of recovering from an error. These two functions are usually distinct, requiring different hardware and software mechanisms.
The Reliability, Availability, and Serviceability (RAS) concept, as implemented in hardware and software, is directed to preventing or recognizing system failures (Reliability), the ability to keep the system functioning in the event of failure (Availability), and the ability to fix the failure in a non-disruptive way (Serviceability). RAS may be addressed at various levels of system development; to diagnose design bugs during new product development, in manufacturing to identify bad parts during the system build, and to catch errors while the system is operating. RAS may also be directed to various types of failures, including system-level design oversights, logical errors, hard failures, such as hardware faults, or soft errors, such as data errors in memory or after data transfer due to external noise or circuit failure.
In some cases it is only necessary to recognize that an error has occurred. In others it is necessary to diagnose the error, that is, to specifically identify its source. Finally, in some cases it is desirable to remove or correct the error.
In large, scaleable computer systems, high-availability depends on the ability to detect and isolate failures. Once isolated, the failing component is fenced from the rest of the system. In order to determine the root cause and appropriate recovery or repair actions, data must be collected from the failing component while it is still in the failed state, without affecting the steady-state operation of the remaining functioning components in the machine.
First Failure Data Capture (FFDC) data may be analyzed real-time by problem analysis firmware, or transmitted to a remote support location and analyzed by a product support analyst. In designs which use Level Sensitive Scan Design (LSSD) latches, this FFDC normally requires stopping the logic clocks to only the failed component and scanning out the state of the latches from only the failed component.
Stopping clocks and scanning out the latch values from a failed component while the rest of the machine continues running requires separation of the clocking boundaries and scan chains. Fine granularity in scan domains is desirable to reduce the payload for many test or initialization functions. In large, scaleable, multi-node computer systems, the number of clocking boundaries and scan chains across all the chips in the system can be very large (thousands). The control for these clocking and scan chain boundaries is often distributed across the chips in the system due to the large number of I/O connections which would be required to independently control them all from a single chip or controller. System control firmware is then required to manage the distributed clocking and scan controls. The complexity of the system control firmware leaves it prone to defects, just like any other complicated software or firmware application.
If the clocks are inadvertently stopped to a component in the part of the machine which is still running, it will cause that part of the machine, or even the entire machine to fail. If a scan chain is accessed while logic clocks are still running, it will cause the corresponding component or the entire machine to fail. Because a defect in the system control firmware could cause the clocks to be stopped incorrectly or a scan chain accessed incorrectly, it is desirable to have a method to protect such firmware defects from disturbing components which are still running in the machine.
Other known solutions to this problem include using a dedicated clock-chip with hardware state machines to control the stopping and starting of each clock domain, and providing scan clocks to a targeted chip/scan chain only if logic clocks are turned off, or relying on firmware to explicitly validate checkstop status or clock-stop status before respectively stopping clocks or scanning a chain.
There are multiple disadvantages of the known method of using a dedicated clock chip. First, it is an additional part number in the chipset that makes up the machine, which adds cost and increases the footprint of the computing building blocks. Second, it requires many connections between the clock chip and all the clock domains and scan chains across all the chips, which again adds packaging cost and additional possible points of failure.
If the scan clocks are driven independently to each scan chain, chips with multiple scan chains must internally wire multiple sets of scan clocks. Because it is desirable to scan at fast frequencies for chip-level testing, the scan clock wiring requires some amount of “balancing” in the design, so multiple sets of scan clocks greatly increases the design effort of the scan clock distribution. This problem could be alleviated by using a separate scan enable signal for each scan chain and gating the scan clocks locally in each chip for each chain, but a separate scan enable signal for each chain dramatically increases the already heavy connection requirements from the clock chip. Encoding values may also help reduce connections, but then also reduces flexibility in selecting multiple scan chains at the same time for efficient test and initialization sequencing.
Relying on firmware to explicitly validate checkstop status or clock state does not provide complete protection from firmware bugs. Adopting a common practice in coding can reduce the likelihood of bugs, but does not eliminate them. And when a firmware bug does cause the machine to fail, it often fails such that the hardware appears to have had a problem instead of the firmware, which results in incorrect diagnostics and repair actions.