Information Technology organizations are frustrated with the cost of server downtime, the rising frequency of malicious attacks and programs, and the significant operational expenditures required for managing their systems. The complexity and cost of server management is increasing. The server management tasks include: fault diagnosis, software upgrade and patching, backup and recovery, monitoring and resource reallocation. Many architectural solutions to the server management problem have been devised with a variety of hardware and software components; however complete delegation of management authority to the server has not yet been achieved. High levels of human intervention are still required for server management.
IBM™ has established a new paradigm in computer networks called “Autonomic Computing™”. In short, autonomic computing is an approach to “self-managed” computing systems with a minimum of human interference. A true, complete autonomic system requires that computer systems be reconstructed from software at the high end to hardware components in every computer. In addition to IBM, Microsoft™, HP™, Dell™ and Sun™ have begun their own autonomic computing initiatives. While “total” autonomic computing may be the ideal its successful and complete implementation will take years. There are alternative approaches that will help increase their systems' uptime, while lowering their systems' Total Cost of Ownership (TCO).
The quickest way to begin reducing the TCO is to attack the area of “least value-add”, namely labor. The two most common existing approaches to reducing labor costs are adding additional hardware or additional software.
With regard to hardware based approaches so called “watchdog” cards are inserted into host computers. A separate software process runs on the host, and communicates heartbeats to the watchdog card across the bus. After not receiving a heartbeat for a pre-set amount of time, an on-board timer will expire, causing the watchdog card to cycle the power of the host. Remote Management Cards (RMCs) are add-in server cards that allow system administrators (SYSADMINs) independent access to the server. The key problem with RMCs is that a SYSADMIN is required to intervene if the card detects a problem, thus mitigating any cost savings. Furthermore, RMCs are not autonomic. RMCs are able to scan vast amounts of information, but they are unable to make decisions on their own. RMCs are generally higher-power cards with more powerful CPU (e.g. 200 MHz CPU). This allows a remote system administrator to log into a host machine, and redirect KVM (Keyboard, Video, Mouse) across the remote connection. This facilitates configuration and troubleshooting as though the admin was sitting directly in front of the actual host. Host video and keyboard connectors are actually inserted into the remote management card ports. Most remote management cards offer watchdog timers and functionality.
FIG. 1 is a schematic diagram of a current network administration system. A server system 1 includes a server chassis 2, a server 4, a Network Interface Card (NIC) 6 and a lights out card 10. Communications in the server chassis 2 are provided for by communications bus 16. The lights out card 10 and network administration system 12 are elements of a management system 8. The lights out card 10 provides local control capability, including power, and intercepts for the keyboard, video and mouse. The lights out card 10 further gathers data and invokes simple actions initiated by a remote user. Monitoring of the server 4 is performed by a Host Management Agent (HMA) 14. The observations made by the HMA 14 are sent to the Network Administration System 12. The system 12 receives telemetry data from and provides policy management for the server 4. It will be apparent to one skilled in the art that there may be more than one server and one or more peripheral cards in the sever chassis 2. The system 12 further determines appropriate actions based on the collected information and provides information to a user for evaluation.
Host Monitoring Agents (HMAs) are software-oriented products that are relatively inexpensive and are typically based on industry standards. HMAs reside on a server's primary hardware and operate in conjunction with the host operating system. However, the major drawbacks of HMAs are that they impact host performance (HMAs take away CPU cycles from the server to run monitoring software), are susceptible to host faults (if the host crashes the HMA crashes), and have narrow monitoring visibility (do not monitor internal functions of operating systems and applications). This category of products features software agents that monitor the host system(s). Operational information collected is passed to a separate management station across the network, where it passes through several “filters”. Upon successful match by a filter, a response action is initiated, being either a correction action on the host (e.g. restart failed process) or notification (e.g. by email) of a human system administrator.
There have been some recent technologies entering the market that allow a server or desktop to create a protected partition on the host hard drive. In the event of hard drive corruption, the user can essentially replace the master partition with the copy in the protected partition.
There is a need for providing new hardware and software that addressed the issues left unresolved, thereby increasing the availability and reducing the TCO of servers.