Practically every type of computer and other programmable electronic device requires some form of initialization upon initial power-up. Such initialization, also referred to as a boot operation, typically involves the initialization of hardware and/or software to bring the device from a powered-off state to a normal operating state.
Whereas in some relatively simple devices the initialization process is so quick as to be indiscernible to a user, with more complex devices, the initialization process can take a substantial amount of time to complete. For some single-user computers, for example, a boot operation may take several minutes to complete. With more complex computers, such as servers and other multi-user computers, a boot operation can take even longer.
From the perspective of a computer, a boot operation typically involves the initialization and set up of various hardware devices such as processors, memory, and peripheral or input/output devices. A boot operation also typically includes initial software set up to load operating system software into the working storage of the computer. In addition, during a number of these initialization operations, various diagnostic operations may also be performed on the various hardware devices on the computer. For example, many integrated circuits, or chips, incorporate built-in self-test (BIST) capabilities to perform internal hardware diagnostics and report any errors that are detected in such circuits. In addition, diagnostic operations such as memory tests and error correction code (ECC) tests may be performed on memory devices to verify proper memory cell operation. Similar testing may be performed on interfaces, printed circuit boards, and other devices, e.g., using scan chain testing and other known techniques.
In addition to detecting failures, diagnostic operations performed during boot operations also typically enable the sources of such failures to be isolated in any problematic hardware devices to enable the computer to complete the boot operation and enter an operational state, but with the problematic hardware devices disabled or otherwise made inaccessible to the computer. As such, a computer may still be able to enter its operational state irrespective of some hardware failures.
The performance of diagnostic operations on hardware devices during a boot operation, however, often comes with a performance penalty. As such, diagnostic operations often increase the time required to complete a boot operation.
Particularly in high availability environments, it is often desirable to minimize system down time, and as a consequence, minimize the amount of time required to boot or initialize a computer. For this reason, in many instances a decision is made to forego many of the diagnostic operations that may be performed during a boot operation in favor of faster initialization.
In some high performance computers, e.g., the iSeries and pSeries servers available from International Business Machines Corporation, a boot operation commonly referred to as an initial program load (IPL) operation may be performed in either a “fast” mode or a “slow” mode. In a fast boot operation, the primary focus of the boot operation is to get the computer to an operational state as quickly as possible. As a result, only minimal hardware diagnostics are run on the system, such as performing some limited ECC checks, writing initial zero values to memory, and various BIST operations that are run by default by a number of integrated circuits at power-on. In a slow boot operation, on the other hand, full hardware diagnostics are run on every hardware device in the system. However, the full diagnostics performed during the slow boot operation may increase the overall boot time by 25 percent or more as compared to a fast boot operation.
In many instances, the additional overhead of a slow boot operation is not deemed warranted, and as a result, the aforementioned computers are typically initialized using a fast boot operation whenever possible. In the event of a hardware failure during a fast boot operation, however, the failure will often be expressed in an unexpected manner, as any hardware diagnostics that might otherwise detect the failure in a particular device are typically not performed during the fast boot operation. As an example, an interface alignment procedure during a fast boot operation may fail because of a bad wire on an interface, however, due to the lack of diagnostics run during the fast boot operation, the boot would simply fail unexpectedly. In addition, the defective part at issue in such a circumstance may or may not be identified depending on how the failure is expressed, i.e., based upon how the failure causes the unexpected result in the computer. In many instances, for example, the computer may simply lock up and become unresponsive.
To address this problem, it may be necessary to essentially reboot a computer that fails as a result of a fast boot operation in “slow” mode to run full hardware diagnostics on every device in the system. Such a reboot may be performed manually, i.e., in response to user intervention, or may be automatically triggered as a result of a failure during a fast boot operation. Nonetheless, in order to correctly identify and isolate a failure, a computer is typically required to be fully rebooted using the slow boot operation, thus increasing the time needed for the computer to initialize to an operational state.
Another drawback to the use of fast and slow boot operations is the potential for generating misleading error logs. Error logs are typically generated in response to identified failures in a computer. In the event that a failure is detected during a fast boot operation, an error log may be generated for the failure. However, given that the failure may be expressed not as a result of a diagnostic operation, the error log may be unable to accurately reflect the source of the failure. Moreover, if a failure occurs in a fast boot operation, and the system is then rebooted using the slow boot operation, the slow boot operation may create another error log related to the failure, which in the best case is an exact duplicate of the error log generated during the fast boot operation, and at the worst case, identifies an entirely different source of the same failure. The presence of multiple error logs directed to the same failure can complicate diagnosis and repair of a computer by service personnel. Therefore, substantial need exists in the art for an improved manner of initializing a computer or other programmable electronic device, which provides faster initialization while ensuring appropriate detection and isolation of failures occurring during a boot operation.