1. Technical Field
The present invention relates in general to clustered computer systems and in particular to management of clustered computer systems. Still more particularly, the present invention relates to a method and system for performing liveness checks within a cluster of thin servers or server appliances utilizing programmable network interface cards.
2. Description of the Related Art
There is a desire within the computer industry to provide more efficient ways to manage a farm or cluster of thin servers or server appliances being utilized for web-based application service as a single system while maintaining multiple execution images.
Currently, server systems that require management often contain service processor cards with a standard microprocessor, local memory, and a standard bus interface. These service processor cards typically have an interface to a specialized management bus located on the server system planar. State-of-the-art examples of service processor cards include the Wiseman card manufactured by International Business Machines for Netfinity and related products, integrated implementations of the same function, and the Emergency Management Processor (EMP) card manufactured by Intel Corporation. Another recent example is the implementation by Intel Corporation of the management bus, baseboard management controller and processor, and emergency management port on the system planar.
Current server systems with a service processor implementation (card or integrated on the planar) contain both a general network interface and the service processor, which, in turn, has a private network interface that is typically a serial one. The network interface card is an input/output (I/O) device, usually attached by an I/O bus to processor and memory, while the service processor is attached not only to the I/O bus but also to a bus such as the Inter-Integrated Circuit (I2C) bus that gives it access to hardware sensor information such as power levels, temperatures, fan speeds and so forth.
Typically, both the network interface card and the service processor are programmable, and since the service processor is connected to the external world through one or more communications interfaces, both have physical network connectors and network hardware and software implementations. Also, the network interface card has to be xe2x80x9cawakexe2x80x9d enough to respond to wake-on-LAN packets, and the service processor generally supports an interface for powering-up the system. Thus, both require some form of always-active logic and partial power.
When large numbers of thin servers or server appliances are congregated together in a single rack or set of racks to provide critical services such as running web-based applications, they must be managed. In the MetaServer environment, one design goal is to minimize the memory and performance impact of management on the thin servers running the applications by off-loading as much as possible to a management server appliance, referred to hereinas the MetaServer. One of the most important things that a MetaServer must monitor is the liveness of the servers in the MetaServer""s environment.
Liveness, in the context of the MetaServer environment, refers to whether all of the elements or systems in the server farm are connected to the network and making forward progress in the processing or data transmission activities. The MetaServer system maintains a current representation of which systems are live and which are not. However, maintaining liveness information for a large number of systems can involve substantial overhead on both the MetaServer and the systems being managed (or targets), and requires the transmission of a large number of packets on the network, which consumes network bandwidth solely for the purposes of monitoring. The more often the MetaServer updates the liveness information about the systems in the server farm, the greater the overhead incurred.
The current state of the art in liveness detection has the operating system images in a cluster of thin servers or server appliances exchange messages called heartbeats. The process is often completed on a special management network although the application network may also be utilized. There have been a few, very specialized interconnection mechanisms that have implemented their own connectivity detection schemes, but these only monitor the interconnection mechanism itself and not the state of the systems attached to the interconnection mechanism.
The present invention recognizes that it would be desirable to provide a simple and low-overhead way of determining liveness within a MetaServer environment. A method and system that utilizes the facilities provided by programmable network interface cards to perform liveness detection without requiring additional hardware is a welcomed improvement. These and other benefits are provided by the present invention.
Disclosed are a method and system for determining liveness of target managed systems within a MetaServer environment utilizing the function provided by programmable network interfaces included within the targets and the management server appliance or MetaServer. The network interface, whether in card or integrated form, is provided with direct access to the associated target computer""s processor and memory. Specifically, one or more memory locations on the target are allocated to store values that indicate the liveness of the system. The parameters selected are ones that increase when the computer system is making forward progress or operating. The MetaServer stores initial parameter values for the target when the target is started. After a pre-selected period, the network interface of the target accesses the associated memory locations and provides the current values to the MetaServer. The MetaServer stores the new values and compares them with the previously stored values. When the new values are larger than the previously stored values, the target is assumed to be live. Otherwise the target is assumed to have failed.
One of two different trigger mechanisms for liveness checking may be implemented. The first, a push mechanism requires an internal timer be set at the target that signals the network interface at a particular frequency when to transmit the parameter values stored in the memory to the MetaServer. The second, a pull mechanism, involves the MetaServer periodically transmitting a request packet to the network interface of the specific target. The target then responds with a reply packet containing the parameter values.
All objects, features, and advantages of the present invention will become apparent in the following detailed written description.