The background description provided herein is for the purpose of generally presenting the context of the disclosure. Unless otherwise indicated herein, the materials described in this section are not prior art to the claims in this application and are not admitted to be prior art by inclusion in this section.
One of the biggest problems with managing today's datacenters has to do with server outages/down time. Any downtime in a datacenter is a costly, highly visible affair. The problems are especially amplified for large datacenter providers due to the sheer volume of nodes that they have across their global datacenters. When an IT administrator is managing 100's of thousands of nodes, the mission-critical status of their equipment and the desire to avoid client outages is paramount.
One of the biggest issues that come up in large datacenters has to do with client-reported outages. Under most circumstances, these outages are rooted in the need for a server to physically restart, which in turn causes all of the sessions hosted by the server to be disconnected. Today, server solutions mitigate some of this by doing the equivalent of a kernel soft reboot, which means the platform itself doesn't reset, just the sessions temporarily get restarted. However, there are some system events (e.g. firmware updates which require fabric resets for frequency changes, etc.) which require a platform-level reset. When these happen, it wreaks havoc in the datacenter, especially when you consider there are many thousands of servers deployed in any IT managed datacenter. These occurrences happen much more frequently that one might normally think.
By definition, any server that wants to maintain a 99.999% (five 9's) uptime must not be unavailable for more than 5½ minutes a year. Most server solutions have put in place solutions to address what is otherwise a length initialization time called a KSR (kernel system reset)—this is very much akin to what Linux does with KEXEC by avoiding the full reset of the system. The server can bring down and up a session without having to encumber it with heavy initialization times. However, the initialization times for a full system reset cannot be avoided in some cases, and this is where we find boot times that are often in excess of three minutes. This means that a server can today go through a boot cycle only once a year and hope to maintain its mission critical status. This is substantially an untenable situation for most datacenters.