One of the major responsibilities of a system administrator in a datacenter is remote data recovery upon disk drive and operating system failures. Current data recovery techniques from failed disk drives can be manually intensive. In some cases data recovery operation includes one or more of visiting the datacenter, selecting recovery media, reconfiguration of the hardware device to boot up using selected recovery media and so on. This data recovery operation can become even more complicated in a heterogeneous datacenter having multiple operating systems, file systems and vendor devices.
Another one of the major responsibilities of a system administrator in datacenters is remote diagnostics of complex hardware component failures that may be isolated to a single field replaceable unit (FRU). For example, in hyper scale environments with tens of thousands of servers, reliability and availability is built into the application layer, making a single or multiple node failures a non concern from an application availability perspective. In normal scenarios, when a hardware component fails, diagnostic software may be run on the hardware component/device to detect any potential failures and the hardware component may be either replaced or reimaged completely before placing the device back in operation in a cluster. However, in non-hyper scale environments and mission critical environments, it may be necessary to perform root cause analysis to determine the nature of hardware component failure before initiating a failback operation. Due to the complexity of hardware component designs and nature of hardware component failures, it may not be possible to accurately diagnose the nature of hardware component failures to single FRU using tools shipped with a hardware device. In such situations, remote diagnostic tools may have to be launched and run in an offline mode to determine the nature of the hardware component failure. For mission critical operations, this can be very time consuming and can significantly increase application downtime.