1. Field of the Invention
The present invention relates in general to bus error handling and in particular to handling bus errors during the boot process of a symmetric multiprocessor (SMP) system. Still more particularly, the present invention relates to handling of bus errors during the boot process without needing to correct the error before proceeding with the boot process.
2. Description of the Related Art
Since the early 1980's, the personal computer industry has grown by leaps and bounds. Improving the operational speed of computer systems is demanded by consumers and is the driving force behind the rapid development and evolution of computer systems. Initially, research and development focused on increasing the speed of the single processor used by early systems; more recently, substantial effort has gone into the utilization of multiple processors in a computer system to perform parallel processing, thereby increasing the speed of operations even further.
The use of multiprocessor systems clearly has increased the operational speed obtainable in computer systems, but the complexity they introduce has also created problems. Servers in particular may have hundreds of I/O devices (e.g., ISA devices such as keyboards, pointing devices, etc., and PCI devices, such as hard drives, ethernet cards, etc.) PCI devices typically make up the majority of I/O devices in the system, and they reside in PCI slots. In addition, due to their often small size and ease of interchangeability, PCI devices are prone to damage and/or improper slot insertion, thereby rendering them non-functional or causing them to function improperly.
The PCI devices may be operating at any time, from start-up to shut-down of the server or other system in which they are installed. They may not be operating at all times, but when they are called upon for use, they must be functioning or the effectiveness of the system will be compromised. Conventionally, the PCI devices communicate via PCI adapters (also referred to as “I/O adapters” or “IOAs”). Multiple PCI adapters connect to a PCI host bridge via a PCI bus. Numerous load and store operations are communicated along the PCI bus, and errors that occur during the load and store operations need to be corrected for proper operation of the system.
To enhance the system recoverability from errors that occur during load and store operations when the system is performing its normal functions (e.g., after the system has completed its boot up process and is performing its intended functions), enhanced error handling (EEH) was developed by IBM (International Business Machines, Armonk, N.Y.). See U.S. Pat. No. 6,223,299 to Bossen et. al, incorporated fully herein by reference. EEH resides in the PCI bridge chip(s) located within the server. Firmware provides a software interface to this hardware function. The EEH program continually monitors the PCI devices connected to the PCI bridge on which it resides during its normal operation and, if an error is detected during a load and store operation, it isolates the PCI slot in which the faulty PCI device is mounted and makes it appear to the rest of the system that the PCI slot is vacant. This assures that any attempts to perform load and store operations will not be directed to faulty PCI devices.
With the increase in system size and complexity, the time required to boot systems has also increased. Since these computer systems have become critical for business operation, their reliability and availability are increasingly more important. For system boot (a.k.a. “cold boot”) it is therefore essential that all the components of the system are thoroughly tested to ensure their proper operation before loading/executing business applications. Accordingly, during a system boot, processes are performed that identify and initialize/configure each PCI adapter to assure proper operation. This added need to extensively test a computer system during the boot process adversely impacts boot time and makes it increasingly more important to limit the number of boot operations that need to be performed.
When booting up a prior art computer system, a single faulty PCI adapter will cause the issuance of an error detect indication that will prevent the entire machine from proceeding further in the boot process. When a faulty PCI adapter prevents the booting of the machine, it must be determined which PCI adapter is defective before continuing. Since often these systems have several hundred PCI adapters installed, determining which one is faulty can be a significantly difficult task; the error log must be examined and a determination must be made as to which of the many PCI adapters is the cause of the failure. Once identified, the system must be powered down, the faulty PCI adapters removed and/or replaced, and then an attempt made to boot the machine again.
This continual ceasing/checking/rebooting operation when a faulty PCI adapter exists can cause great delays and significant inconvenience. Accordingly, it would be desirable to have a method by which faulty PCI adapters could be detected without preventing the booting of the machine.