1. Field of the Invention
The field of the invention is data processing, or, more specifically, methods, systems, and products for handling fatal computer hardware errors.
2. Description of Related Art
The development of the EDVAC computer system of 1948 is often cited as the beginning of the computer era. Since that time, computer systems have evolved into extremely complicated devices. Today's computers are much more sophisticated than early systems such as the EDVAC. Computer systems typically include a combination of hardware and software components, application programs, operating systems, processors, buses, memory, input/output devices, and so on. As advances in semiconductor processing and computer architecture push the performance of the computer higher and higher, more sophisticated computer software has evolved to take advantage of the higher performance of the hardware, resulting in computer systems today that are much more powerful than just a few years ago.
One of the areas of computer technology that has seen considerable advancement is the handling of fatal computer hardware errors. A hardware error is a behavior related to a malfunction of a hardware component in a computer system. The hardware components, typically chips in a chipset, contain error detection mechanisms that can detect when a hardware error condition exists.
A chipset is a group of integrated circuits (‘chips’) that are designed to work together, often marketed as a single product. The manufacturer of a chipset can be, and often is, independent from the manufacturer of the motherboard. Examples of motherboard chipsets include NVIDIA's nForce chipset and VIA Technologies' KT800, both for AMD processors, or one of Intel's many chipsets. When discussing personal computers based on contemporary Intel Pentium-class systems, the term ‘chipset’ often refers to the two main bus adapters, the Northbridge and the Southbridge. In computer technology generally, the term ‘chipset’ is often used to refer to the specialized motherboard chips on a computer or expansion card. The term “chipset” was also widely used in the 1980s and 1990s for the custom audio and graphics chips in home computers, games consoles and arcade game hardware of the time. Examples include the Commodore Amiga's Original Chip Set or SEGA's System 16 chipset. In this paper, the term ‘chipset’ is used to refer to principal integrated circuit components of a computer, including processors, memory modules, and bus adapters.
Computer systems produced since the late 1980s often share commonly used chipsets, even across widely disparate computing specialties—for example, the NCR 53C9x, a low-cost chipset implementing a SCSI interface to storage devices and the like, could be found not only in Unix machines (such as the MIPS Magnum), but also in embedded devices and personal computers.
The story of modern servers is as much the story of specialized chipsets as it is the story of specialized processors and motherboards. The chipset tends to imply the motherboard; therefore, any two server boards with the same chipsets typically are functionally identical unless a vendor adds features to those provided by the chipset or removed support for certain chipset features. Vendors might, for example, add additional chips to support additional features, such as a second 10 Mbps Ethernet, 100 Mbps Fast Ethernet, or 1000Mbps Gigabit Ethernet port.
The chipset typically contains the processor bus adapter, often referred to as the ‘front-side bus,’ memory controllers, I/O controllers, memory modules, and more. Memory controllers may be integrated into a bus adapter. The AMD Opteron processors for servers and workstations, for example, incorporate memory controllers; chipsets that support Opteron processors (or other chipsets that integrate memory controllers into a bus adapter), therefore, typically do not contain separate memory controller chips.
In a typical server, all the principal integrated circuits on the motherboard are contained within the chipset. In a typical computer, chips of a chipset implement connections between processors and everything else. In most cases, the processors cannot communicate with memory modules, adapter boards, peripheral devices, and so on, without going through other chips of a chipset.
Although server chipsets are designed to perform the same types of tasks as desktop chipsets, the feature set included in a typical server chipset emphasizes stability rather than performance, as with a typical desktop chipset. Server-specific chipset features such as support for error-correcting code (‘ECC’) memory, advanced error correction for memory, system management, and a lack of overclocking options demonstrate the emphasis on stability.
Components of a computer traditionally concerned with hardware errors include interrupt handler modules in firmware or in an operating system. Firmware is computer system-level software module stored in non-volatile memory so as to be available to run promptly when power is first applied to the computer, before the operating system is booted. Firmware provides boot routines, hardware error handling routines, and certain low-level I/O routines. A very common example of firmware is the so-called Basic Input-Output System (‘BIOS’). In a traditional architecture for handling computer hardware errors, a chip, upon detecting an error in chip operations, signals an error by throwing an interrupt on a hardwired interrupt signal line to a programmable interrupt controller. The programmable interrupt controller then signals a processor of the interrupt, and the processor vectors the interrupt to an interrupt handling routine (called an ‘interrupt handler’) in BIOS or in the operating system.
Interrupts thrown as a result of hardware errors may be correctable, either by the hardware itself, or by a BIOS or operating system error handler, or by a user-level application routine registered with the operating system as an exception handler. Hardware errors can be classified as either corrected errors, or uncorrected errors. A corrected error is a hardware error condition that has been corrected by computer hardware or by computer firmware by the time the operating system is notified about the existence of the error condition. An uncorrected error is a hardware error condition that cannot be corrected by the hardware or by the firmware. Uncorrected errors are either fatal or non-fatal. A fatal hardware error is an uncorrected or uncontained error condition that is determined to be unrecoverable by the hardware. When a fatal uncorrected error occurs, the system is halted to prevent propagation of the error. A non-fatal hardware error is an uncorrected error condition from which the operating system can attempt recovery by trying to correct the error. Examples of hardware errors that can be fatal include divide-by-zero faults, bounds check faults, invalid opcode faults, memory segment overrun faults, invalid task state faults, stack faults, page faults, and others as known to those of skill in the art.
It is useful to distinguish the source of a hardware error report and the cause of the hardware error. A hardware error source is any chip that alerts the operating system to the presence of an error condition. Examples of hardware error sources include:                Processor machine check exception (for example, MC#)        Chipset error message signals (for example, SCI, SMI, SERR#, MCERR#)        I/O bus error reporting (for example, PCI Express root port error interrupt)        I/O device errors        
A single hardware error source might handle aggregate error reporting for more than one type of hardware error condition. For example, a processor's machine check exception typically reports processor errors, cache and memory errors, and system bus errors. Note that the system management interrupt (‘SMI’) is usually handled by firmware; the operating system typically does not handle SMI.
A hardware error source is typically represented by the following:                One or more hardware error status registers.        One or more hardware error configuration or control registers.        A signaling mechanism to alert the operating system to the existence of a hardware error condition.        
In some situations, there is not an explicit signaling mechanism and the operating system must poll the error status registers to test for an error condition. However, polling can only be used for corrected error conditions since uncorrected errors require immediate attention by the operating system.
Traditionally, interrupts were provided to processors through sideband signals that were not mixed into in-band buses used for data moving and instruction fetching in the system. An ‘in-band bus’ is a bus that carries the computer program instructions and data for carrying out principal data processing on the computer—as distinguished from a ‘sideband bus’ that only carries instructions and data for service communications among peripheral service components of the computer and processors, memory, and other chips of a chipset. A more recent variation on this is that lower priority interrupts were handled as message interrupts, but very high priority interrupts such as SMI and NMI were still handled as sideband signals directly wired to a processor. As such, the highest priority interrupts (those involving unrecoverable errors) were guaranteed to be handled by the processor very quickly without data integrity exposures to the system.
Recently, many computer systems have begun using message interrupts for these high priority interrupts which are mixed in with the system's data and instruction buses, the in-band buses. This approach has some merits, but it opens data integrity holes due to the amount of time it takes for the high priority interrupt to reach the system and the fact that potentially corrupted data is flowing through the system while the enqueued messaged interrupt is making its way to the processor for service.
One traditional way of handling fatal computer hardware errors in systems that use messaged interrupts on in-band buses is to allow some unrecoverable I/O errors, such as, for example, PCI SERR, PERR, and target aborts, to be handled by system software in the form of an SMI or NMI handler. Other unrecoverable errors lead to operating system ‘blue screens’ because they immediately halt the system by causing an NMI or machine check. In such systems, the machine is typically left in the failed state, that is, frozen with a blue screen, for failure analysis until the computer is manually rebooted.
Another traditional way of handling fatal computer hardware errors in computers that use HyperTransport in-band buses is to design a system so that the system goes into HyperTransport sync flood on some or all unrecoverable errors. Such systems attempt to use so-called ‘sticky bits’ in registers that are used for the identification of unrecoverable errors. Such systems reboot the system on detecting HyperTransport sync flood and depend on the BIOS being able to run successfully after the reboot to read the sticky bits and diagnose the problem. Such systems may require an additional reboot so they can take action on what they learn from the sticky bits to configure the system for reliable operation after the failure.
Both of these traditional solutions require that the system processors be able to run on the file system either before or after the system is rebooted. There is always a risk on fatal hardware errors, however that the system will not reboot at all after the failure. In addition, the second solution regarding HyperTransport buses often requires that the system be rebooted more than once after the failure. This is very confusing to users who often interpret such additional reboots as the system ‘thrashing’ itself after a failure. In addition, some errors, such as multi-bit memory errors, link errors in HyperTransport buses, PCI errors, and the like, may be so severe as to cause a firmware or operating system interrupt handler to be unable to run at all. This means a fatal hardware error does not get logged and diagnosed in a detailed, meaningful fashion. All that is known is that the system crashed. Handling fatal hardware errors through firmware or operating system interrupt handlers therefore always bears at least some risk of data loss.