The present invention is directed to systems and methods for out-of-band booting of a computer. More specifically, without limitation, the present invention relates to computer-based remote boot systems and methods for booting a server computer having a hot swap interface and a primary communication channel used to provide access to client computers, without requiring shutdown of the server computer or communication with the server computer via the primary communication channel, using a secondary communication channel connected to a boot management system.
A server normally boots from its usual source, which can be a local disk or boot device (for example, flash ROM etc.) attached to it or from a remote boot device through a primary communication channel such as the ‘usual’ network (LAN) connection. Failure of a server can cause problems ranging from minor inconvenience to catastrophic losses of time and money.
Theoretically, it would be desirable if a server did not fail at anytime. However, in a practical sense, server failures do occur. High availability sometimes is referred to by the time the server provides the service to its clients and measured by the number of nines (‘9’). This number is a measurement of the approximation of the percentages of the amount of time the server provides the service per year. Following is the industry wide report about the ‘nine factors’ (see, e.g., “Providing Open Architecture High Availability Solutions,” February, 2001, p. 13, http://www.haforum.org).
Number of ‘9’sDowntime per yearTypical application3 nines (99.9%)~9hoursDesktops4 nines (99.99%)~2hourEnterprise server5 nines (99.999%)~5minutesCarrier class server6 nines (99.9999%)~31secondsCarrier switch equipments
The main purpose of alternate boot strategy technology is to increase the availability and serviceability of a server. The fault management of the server may consist of the followings components:                1. Detection—the fault is detected properly        2. Diagnosis—detect the root cause of the fault        3. Isolation—so the rest of the system is not affected from the fault        4. Recovery—system is restarted for further operation        5. Repair—the faulty component is removed        
Among the above mentioned components detection, diagnosis and isolation can be better performed through a remote boot from an OS of choice and with proper diagnostic capability. This is due to the fact that the server may be experiencing a fatal problem, which can only be detected by an offline diagnostics. An offline diagnostic is a mechanism when the system is not operating normally. As the normal operation is suspended, the usual boot process is not possible and an out-of-band or virtual boot mechanism according to the present invention is more appropriate and advantageous.
Sometimes recovery and repair of the faulty components involve a graceful shutdown of the resident OS and replacement of one or more OS components. In such cases, the OS may not be functional to upgrade itself from a remote location. A virtual boot or out-of-band boot protocol provides a solution to this situation that is not possible with prior art approaches.
If the server faces the problems due to faulty behavior of the usual boot process defined, then the alternate boot path is mandatory to achieve the desired number of nines and to reduce the downtime.
To prevent the common failure of the server, a backup policy for the boot procedure is used. Two common techniques with respect to the former include providing an alternative local boot path or a remote boot using a server's primary communication channel with its clients.
As depicted in FIG. 1, a typical server computer 100 uses a local hard drive 120 as the source for the boot image used to boot the server; a typical alternative boot path could include use of a locally connected drive 130 loaded with a removable media such as a magnetic or optical disk containing a boot image or use of a second hard (fixed magnetic media) drive or optical fixed media drive. The requirement that an administrator must physically be present at the server to load and/or change the removable media limits the usability of this approach. The use of a local fixed drive requires a local copy of the boot image and may require local supervision by an administrator via input devices such as keyboard 140 and mouse 150 and output devices such as monitor 160.
Another alternative approach to booting a server 100, as depicted in FIG. 2, involves use of a boot image stored on a remote data storage 210 connected to the server's primary communication channel (e.g., Ethernet 230) with its clients 220, or to a secondary communication channel (e.g., secondary network 240). However, this mechanism fails to allow upgrade of any faulty component from multiple mass-storage images. For example, to rectify some problems, the OS must be upgraded, and this requires a series of images stored in multiple removable magnetic and/or optical disks, as the boot image is a static image and server cannot refer to other images. Also in some cases, to diagnose a typical problem, a series of tests may be executed from different mass-storage devices. A standard protocol to boot from a remote image does not allow this to occur.
In this method, a boot image is prepared and made accessible to the server 100 (through either in-band 230 or out-of-band 240) at anytime from a centralized location 210. Several prior art protocols already support this such as PXE (Preboot Execution Environment). However, this mechanism fails to upgrade any faulty component from multiple mass-storage images. For example, to rectify some problems, OS must be upgraded and this requires a series of images stored in multiple CDROM or floppies. As the boot image is a static image and server cannot refer to other images because the standard prior art protocols do not define this. Also in some cases, to diagnose a typical problem, a series of tests may be executed from different mass-storage devices. A protocol to boot from a remote image does not allow this to occur.
The out-of-band systems and methods according to the present invention avoid these limitations. A boot device is implemented at the server side. This device is presented as a ghost device or a virtual device to the server and the software components (such as BIOS and OS). Such devices will be presented as early in the power on process of the server. Hence the server can find this as a potential mass-storage device to boot from. The main advantage of such mechanism is once the boot process starts, it can follow with unlimited references of other mass-storage devices and images. As a result, the server can be repaired (or upgraded) easily.