1. Technical Field
The present disclosure generally relates to information handling systems and in particular to server boot failure recovery within information handling systems.
2. Description of the Related Art
As the value and use of information continue to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system (IHS) generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes, thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
IHSs include a category of systems called converged shared infrastructure systems. A converged infrastructure operates by grouping multiple information technology (IT) components such as other IHSs into a single computing package. Components of a converged infrastructure may include servers, which can include host compute nodes, data storage devices, networking equipment and software for IT infrastructure management. Converged infrastructure provides centralized management of IT resources, system consolidation, increased resource-utilization rates, and lower operational costs.
Following the application of electrical power or a system reset, a server IHS implements a boot-up operation. Often referred to as the binding process, server boot-up involves the basic input/output system (BIOS) loading different vendor drivers and also mapping and managing the drivers and devices. To achieve system management on the supported vendor devices, the BIOS detects these devices and loads appropriate drivers required to facilitate system management functionality related to the respective device. During this process, customers may face one or both of the following issues: (i) there is no operational user control functionality on the device being plugged into the system once the devices leave the factory (or manufacturing facility); and (ii) the drivers may expose issues when certain use cases, such as configuration changes or firmware updates, are executed through a pre-boot application.
The system may enter a bad state (e.g., hang or crash) when there are issues with the drivers/devices. Some possible reasons for the crash situation could include the following: (1) an issue with a device UEFI driver; and (2) an issue with a pre-boot application accessing the driver. Recovery from these crash situations involves tedious trouble shooting, including identifying and/or understanding which device/driver is causing the system to go into the bad state.
Currently, the only work around for recovering from an issue seen during the binding process involves removing the cards one by one until the faulty adapter is located. If the issue happens during a pre-boot application execution phase, such as during inventory collection, job execution, or launching pre-boot, a user interface (UI) customer may have to adapt one of the following recovery methods: (i) remove cards one by one until the faulty device is located; (ii) disable the slot one by one until the faulty device is located; and (iii) disable pre-boot applications. The above mentioned recovery methods are tedious manual processes, which are not feasible solutions when these devices/drivers enter a bad state and are deployed in large data centers.