The present invention generally relates to the data-processing field, and, more specifically, to the deployment of operating systems with detection of loop conditions.
Deployment of operating systems is a time consuming activity, particularly in a large data-processing system comprising a high number of target machines (onto which the operating systems are to be deployed). A typical example is the deployment of operating systems onto target machines comprising bare-metal machines with no operating system installed, thereon, or of failed machines needing the operating system restored (commonly referred to as pristine machines).
The deployment of a specific operating system onto a generic target machine commonly involves booting the target machine over the network. Briefly, when the target machine is turned on, its firmware launches a network bootstrap loader that broadcasts a request for a network bootstrap program. The network bootstrap loader downloads the network bootstrap program from a source machine that has served the request. The network bootstrap program then downloads a deployment engine from the same source machine, which deployment engine, in turn, downloads the complete operating system.
The above-mentioned deployment process may also require multiple re-boots of the target machine. In this case, it is possible to upload an indication of a status of the deployment process onto the source machine; when the target machine re-boots, it retrieves the status of the deployment process (being reached before the re-boot) and then resumes it from that point (as described in U.S. Pat. No. 6,816,964, the entire disclosure of which is herein incorporated by reference).
However, a failure may occur at different points of the deployment process. A technique for addressing this problem is to reboot the target machine so as to repeat the whole deployment process. For example, this may be achieved by properly programming the network bootstrap program and the deployment engine; moreover, it is also possible to provide an additional processor on the target machine for detecting failures thereof (as described in GB-A-2446094, the entire disclosure of which is herein incorporated by reference).
Therefore, if the failure is due to a transient problem (for example, because of a network breakdown), the deployment process may succeed at its next attempt. However, if the failure is due to a persistent problem (for example, because of missing drivers in the operating system to be deployed) this generates an infinite loop of re-boots of the target machine.
In order to alleviate this problem, a time-out is generally defined for the deployment process. The time-out is selected with a length substantially longer than the time that is normally required to complete the deployment process (for example, 1 hour against 45 minutes, respectively); therefore, the deployment process is determined to be in error when it does not complete within this time-out.
However, the above-described technique has several drawbacks. Indeed, any error can be detected only after the expiry of the entire time-out; this takes a relatively long time (to reduce the risk of detecting false errors), during which the target machine may have already executed many re-boots.
Moreover, this does not provide any indication at all about the cause of the error, therefore, the analysis of the error is quite difficult (especially when the target machine is remote and not accessible directly).
In any case, it is not possible to exclude the risk that the time-out is reached even when no actual error occurred—such as when the deployment process is slower than expected (for example, because of high network traffic); this causes the detection of false errors (which instead should solve automatically without any intervention).