As computers have become more sophisticated and particularly as they have been adapted to handle data from a wider range of input/output devices, the booting process has become increasingly complex. While mechanisms have been developed for minimising the effect of boot failure, circumstances can still arise which result in the system crashing or hanging, not least because the process necessarily involves key system operating procedures.
Such circumstances can arise, for example, when a new input/output device has been added to the system, requiring a new device driver to be loaded. If the device driver encounters hardware or operating conditions not anticipated by the device driver designer the booting process can fail, leading to a system crash and/or to a hang condition in which the computer performs a continuous loop.
One approach to the problem is described in U.S. Pat. No. 5,564,054. Using this approach a set of log in files, not accessible to a user and therefore not susceptible to unexpected modifications, is maintained to define a minimal system configuration. After a predetermined number of unsuccessful attempts to load a set of log in files supplied by a user the system is arranged to switch to a boot mode in which the log in files defining the minimum configuration are loaded.
Such an approach leaves the user with a basic minimum configuration which permits the cause of the failure to be investigated. However a number of functions not responsible for the failure are excluded from the minimum configuration. While this may be no more than inconvenient in an individual installation, the loss of such functions can have serious consequences in a system where the computer interacts with other computers, as in a network.
In a typical network individual computers operate under the control of an operating system such as AIX (Trade Mark of IBM Corporation) which, in order to ensure that all functions are available in associated computers when called for, employ booting systems which scan the adapter cards providing functions in the associated computer and load drivers appropriate to support the adapter cards. If all goes well the loading proceeds without incident. However if unforeseen circumstances are encountered, or if for some other reason the booting process fails, the result can be a crashed system, causing loss of all services from the affected machine. If the system is configured to re-boot automatically in the event of a crash, a re-boot could occur, which in turn can cause the same load problem resulting in a continuous loop of starting the boot-up, loading the (failing) driver, crashing, re-starting the boot-up, and so on. This endless looping condition also renders the computer unavailable to the rest of the network.
It is accordingly an object of the present invention to provide a system and method for boot failure recovery in a digital computer which addresses this problem.