1. Technical Field
The present disclosure generally relates to information handling systems (IHS) and in particular to failure detection and recovery within information handling systems.
2. Description of the Related Art
As the value and use of information continue to increase, individuals and businesses seek additional ways to process and store information. One option available to users is information handling systems. An information handling system (IHS) generally processes, compiles, stores, and/or communicates information or data for business, personal, or other purposes, thereby allowing users to take advantage of the value of the information. Because technology and information handling needs and requirements vary between different users or applications, information handling systems may also vary regarding what information is handled, how the information is handled, how much information is processed, stored, or communicated, and how quickly and efficiently the information may be processed, stored, or communicated. The variations in information handling systems allow for information handling systems to be general or configured for a specific user or specific use such as financial transaction processing, airline reservations, enterprise data storage, or global communications. In addition, information handling systems may include a variety of hardware and software components that may be configured to process, store, and communicate information and may include one or more computer systems, data storage systems, and networking systems.
IHSes typically includes a number of different hardware components that operate using a set of control/operating code or firmware. The firmware is an integral part of these hardware components and can have varying levels of complexities. As with any piece of software, failure can commonly occur during firmware execution, often as a result of defects in the firmware. Firmware defects can be very serious, especially if the defect affects the operation of components that provide remote and/or network access to/from the IHS or the particular firmware handles critical system management tasks in keeping the server workloads running flawlessly. A typical lifecycle of an issue if detected on field involves information technology (IT) support trying to recreate the exact problem, gathering all analytical information about the problem, looking at any historic data associated with the involved components and suggesting a set of steps to recover from the problem state. These steps can be done by technical support or with the help from an engineering development organization. Once a fix is identified, the fix gets rolled into a future firmware release, and a workaround is applied in the interim.
There are several problems with the above described process. First, the process is ad-hoc. In addition, the IHS, facing the issue, is taken down (stopping processing of active workloads) in order to try out solutions and apply a patch for the workaround. The patches may or may not stick across alternating current (AC) and direct current (DC) cycles, exposing the IHS again to the same or another issue. Furthermore, none of the learnt information from the above process of identifying a patch is archived for application to another system/customer facing a similar issue.