1. Field of the Invention
This invention relates generally to improving operational efficiency in computer systems and, more particularly, to providing computers with the ability to automatically attempt to reboot after a failed boot attempt.
2. Background of the Related Art
This section is intended to introduce the reader to various aspects of art which may be related to various aspects of the present invention which are described and/or claimed below. This discussion is believed to be helpful in providing the reader with background information to facilitate a better understanding of the various aspects of the present invention. Accordingly, it should be understood that these statements are to be read in this light, and not as admissions of prior art.
Since the introduction of the first personal computer (“PC”) over 20 years ago, technological advances to make PCs more useful have continued at an amazing rate. Microprocessors that control PCs have become faster and faster, with operational speeds eclipsing the gigahertz (one billion operations per second) and continuing well beyond.
Productivity has also increased tremendously because of the explosion in development of software applications. In the early days of the PC, people who could write their own programs were practically the only ones who could make productive use of their computers. Today, there are thousands and thousands of software applications ranging from games to word processors and from voice recognition to web browsers.
In addition to improvements in PC hardware and software generally, the technology for making computers more useful by allowing users to connect PCs together and share resources between them has also seen rapid growth in recent years. This technology is generally referred to as “networking.” In a networked computing environment, PCs belonging to many users are connected together so that they may communicate with each other. In this way, users can share access to each other's files and other resources, such as printers. Networked computing also allows users to share internet connections, resulting in significant cost savings. Networked computing has revolutionized the way in which business is conducted across the world.
Modern computer networks come in all shapes and sizes. At one end of the spectrum, a small business or home network may include a few client computers connected to a common server, which may provide a shared printer and/or a shared internet connection. On the other end of the spectrum, a global company's network environment may require interconnection of hundreds or even thousands of computers across large buildings, a campus environment or even between groups of computers in different cities and countries. Such a configuration would typically include a large number of servers, each connected to numerous client computers.
Further, the arrangements of servers and clients in a larger network environment could be connected in any of an infinite number of topologies that may include local area networks (“LANs”), wide area networks (“WANs”) and municipal area networks (“MANs”). In these larger networks, a problem with any one server computer (for example, a failed bard drive, failed network interface card or OS lock-up to name just a few) has the potential to interrupt the work of a large number of workers who depend on network resources to get their jobs done efficiently. Needless to say, companies devote a lot of time and effort to keeping their networks operating trouble-free to maximize productivity.
An important aspect of efficiently managing a large computer network is to maximize the amount of analysis and repair that can be performed without intervention by the network management team that maintains the network. Operations that require manual intervention, such as manually rebooting a group of computers that have failed an initial boot operation and have become hung, are extremely labor intensive and time consuming.
An example of such a situation may arise when the software of a large number of geographically dispersed server computers is upgraded, requiring all of the upgraded server computers to reboot. Upon reboot, each server has a list of storage devices from which it will attempt to boot (load an OS) in a specified order. This list may be referred to as a “standard boot order” list. For example, the standard boot order list for a given server may be as follows: network drive, floppy drive, CD ROM drive, system hard drive. The standard boot order is configurable by the network management team and typically reflects the preferred boot order based on the topology and capabilities of the individual network. Many modern networks have the capability to boot over a network connection using an industry standard such as the Pre-boot eXecution Environment (“PXE” (pronounced “pixie”)) or the like. Other standards that support booting over a network connection include iSCSI and BOOTP/TFTP.
When a server is rebooted, its basic input-output system (“BIOS”), which is the low level programming that initializes the computer, examines the standard boot order list and attempts to boot from the first device on the list. In so doing, the BIOS executes a portion of BIOS code commonly referred to as the interrupt 19 (or “INT 19”) handler. The INT 19 handler attempts to read the master boot record (“MBR”) of the device that is first on the list in the standard boot order list of the server. If the MBR is valid, the server will attempt to boot from that device. If the MBR is not valid or if the server is unable to boot from the specified device for other reasons, program flow returns to a section of the BIOS typically referred to as the interrupt 18 (or “INT 18”) handler, which signifies that a boot attempt has failed. Examples of reasons that may prevent a proper boot from a selected media even if the MBR is valid are (1) no system files on the selected media, (2) corrupt partition table on the selected media, or (3) no partition table on the selected media.
The BIOS continues execution by attempting to boot from the next device on the standard boot order list. Under control of the BIOS, the server continues to try to boot from each device in the standard boot order list until it finds a device that has a valid MBR and no other issues preventing a proper boot. If the BIOS reaches the end of the standard boot order list and no successful boot has occurred, the server hangs or ceases operation. A message such as “No System Disk” or the like is displayed and user intervention is required to retry booting the server.
As is readily apparent, user intervention in the form of physically going from computer to computer and rebooting each hung server manually by providing a disk with a valid OS could be very time consuming. This is particularly true when servers are operating in a “headless” environment. Headless servers are servers that are not equipped with display monitors, mice and keyboards. Rebooting headless servers may involve connecting additional hardware to the server or accessing the server through a remote server management tool. In the meantime, users of the network may be idled and unable to access network resources.
There are many reasons that could cause a large number of computers to fail a mass reboot attempt with many servers becoming hung and in need of manual user intervention. If the server computers are trying to obtain their operating system (“OS”) over a network connection, the reboot attempt could fail because the network connection is not available when reboot is attempted. Another reason the initial boot attempt might fail is if the network server containing the OS on the network is temporarily unavailable when the boot attempt is performed. Also, individual server computers could fail to reboot because they could have other problems independent of network conditions. Servers which are infrequently rebooted could suffer from an inordinately long hard drive spin-up time. Such servers may fail an initial boot attempt and be hung because their hard drive does not spin up fast enough. These causes may combine to leave a large number of server computers in an unusable state requiring physical user intervention after a mass boot attempt. A way to avoid having a large number of computers requiring user intervention after a failed boot attempt is desirable.