The present invention relates to power-up failures experienced in complex computing systems, and more particularly, to preventing unexpected power-up failures of individual hardware components of complex computing systems.
All computing systems are subject to failures and malfunctions from time to time. Some failures may be foreseeable and preventable, while some others may be random, unexpected and ultimately unresolvable. Moreover, some failures may be based on software, firmware, or some other hard or soft-coded logic issue, while other failures may be hardware-based. Within these hardware-based failures, there is a subset that are only detected when a device is initially powered on, such as when a Power On Self Test (POST) is run on the device initially after powering on the device. In a fully redundant system, these hardware-based failures may be dealt with without loss of data or access to data, as long as the failures on the redundant components performing the same task occur at different times.
However, when a system loses input power (e.g., a site-wide power outage), all devices lose power at the same time and this selective powering-down is not possible. Thereafter, as the system powers back up, components and devices within the system, including redundant devices, may detect failures during this power-up process, and in response to such power-up failures being detected, the system is not able to fully come online in a timely manner because hardware failures are detected at the same time. This inability to bring the system fully back online in a timely manner after a power loss is a significant problem for enterprise-level Information Technology (IT) data centers.