1. Field of the Invention
The invention relates to network service groups, and more particularly, to a method and apparatus for monitoring the status of services performed by such groups.
2. Description of Related Art
In a client-server networking environment, a network service is typically provided by an application running on a server machine that processes requests sent from many client machines via the network. The primary challenge is to provide a scalable and reliable platform for a network service to process an increasing volume of client requests. U.S. patent application Ser. no. 08/763,289 entitled xe2x80x9cLoad Balancing and Failover of Network Services,xe2x80x9d to Swee Boon Lim, and filed Dec. 9, 1996, now U.S. Pat. No. 5,938,732, describes a system having scalable and reliable architecture that uses a group of server machines, each running the same application, to cooperatively provide a network service. To a client machine, the group of server machines, hereafter referred to as a service group, appears as a single server machine.
In such applications, one server machine may provide multiple network services and may belong to multiple service groups. Each server machine that is a member of a service group has a service monitor that monitors the workload and determines the availability status of the service on the server machine. Service monitors of a service group communicate with each other in the service group through a network. One of the service monitors is designated as the service group leader, which periodically collects the workload and availability status of each member of the service group. The service group leader uses the information for load balance and recovery of service crashes.
There are three methods that a service monitor can use to obtain workload and availability status of a service. These are the direct method of remote procedure calls (RPCs), the direct method of posting, and the indirect method of using operating system services. Under the first, remote procedure call (RPC) method, the service monitor implements the client side of the RPC, and the service being monitored by the service monitor implements the server side of the RPC. The service monitor periodically makes a remote procedure call to the service in order to obtain workload information. If the service monitor does not receive an RPC reply from the service and a retry fails after a certain interval, the service monitor can safely assume that the service is dead.
While the RPC approach is straightforward, programming RPCs and setting up the RPC client and server is a complex and expensive procedure, making other options, such as the posting method, attractive. Under the posting method, a service periodically publishes separately its workload and availability status (for example, as the current time) using operating systems services such as shared memory or IP multi-cast or broadcast. A service monitor retrieves the published information from the service that it monitors periodically, which may be at a different interval than that in which publication by the service occurred, and determines the status of the service therefrom. The service monitor can thus detect that the service that it monitors is not available if the time published by the service that represents service availability does not change after a certain interval. Again, however, a drawback of the posting method is the complex programming technique needed to program a service and its service monitor.
Under the indirect method of using operating system services, a service monitor can take advantage of utilities provided by the operating system to determine for example whether a computing process exists, or to obtain the overall CPU (Central Processing Unit) utilization of the system, without directly communicating with the service it monitors. This is particularly useful since service monitors and applications running on a server machine are typically implemented as background computing processes in a server machine and thus lend themselves to such interaction. One operating system that provides such utilities is Solaris(trademark) of Sun Microsystems.
As an example, a service monitor may rely on the CPU utilization of a server machine as a representation of the current workload of a service running on the server machine. There is a direct relationship between the CPU utilization of a system and the workload of a service running in the system. This is because if the CPU utilization of a server machine is high, there will not be enough CPU cycles left over for the service in the server machine to process client requests, i.e. high workload.
Similarly, a service monitor can periodically check the existence of the computing process that implements the service. If the computing process of the service does not exist, the service is not available.
The indirect method is efficient since the cost of obtaining workload or availability status is merely the cost of using the system calls provided by the host operating system. In addition, the service monitor can obtain the workload and service availability information any time without waiting for the service to respond.
A drawback of the indirect method is that it cannot obtain workload information specific to a service. Also, the indirect method cannot really be sure that a service is available: it cannot distinguish a hung computing process that cannot process client requests from a normal computing process that can process client requests.
Thus, it is desirable for a service monitor to obtain workload information and availability status directly from the service that it monitors since such an approach would yield more accurate results. Since direct methods conventionally involve making changes to the services so that workload information specific to the services could be communicated to the service monitors, it is desirable that this information be communicated without using complex programming techniques such as shared memory, remote procedure calls or networking programming. Finally, it is desirable to reduce the cost of obtaining this information in order to maximize service throughput.
The invention overcomes the deficiencies of the prior art by providing a covert channel to allow communication between a service monitor and the service that it monitors without incurring excessive overhead for monitoring and updating or passing messages indicating certain information. In the preferred embodiment, the covert channel is a communication file whose size corresponds to the workload of the service being monitored, such that the service monitor can determine the workload of the service by merely examining the communication file size attribute. The communication file is also constantly updated by the service in order to provide a xe2x80x9cheartbeatxe2x80x9d to the service monitor indicating that service is available.
In a second embodiment in accordance with the invention, the xe2x80x9cheartbeatxe2x80x9d is provided by a second communication file provided for that service. A separate communication file is especially desirable in systems which do not provide a time stamp, or last modification date of a file, in which case the second communication file, by being continuously modified in size by the service, provides the indication that the service is available.
The service or process running on a server in a multi-server environment thus periodically updates information about the communication file to indicate the status and availability of the process. The file is typically a xe2x80x9choleyxe2x80x9d file, that is, one that occupies no file system memory. The running process updates the size of the file to indicate, for example, the workload of the running process, and the date-last-modified to indicate the availability of the running process. Any other running process and/or monitor, and even ones on other servers, need only examine the file attributes to determine the running process"" status and availability. Thus, a covert channel is established between the running process and the monitoring process, bypassing all normal message processing overhead.