The present invention relates generally to computing systems, and more particularly to a system and method for monitoring the state and operability of components in distributed computing systems. The present invention indicates whether a component is operating correctly, and reliably distributes the state of all components among the elements of the system.
In any distributed computing system, it is desirable to monitor the state of the various components (e.g., to know which components are operating correctly and to detect which ones are not operable). It is further desirable to distribute the state of all components among the elements of the system.
In known prior art, xe2x80x9cheartbeatsxe2x80x9d sometimes referred to as xe2x80x9cI""m alivexe2x80x9d packets are used to distribute the state of all components. Particularly, these types of packets are employed in computing systems that use a point-to-point messaging mechanism, and in cluster membership services that use a type of ring topology where messages are sent from one machine to the next in a chain including a list of current members. However, in all of these prior implementations, each machine sends a packet to every other machine, thereby requiring an N2 algorithm to distribute state information. To reduce the number of messages from order N2 to order n, the present invention uses a reliable multicast protocol to distribute state information.
According to the disclosed embodiments, a method and system is provided for determining whether a given component in a distributed computing system is operating correctly, and for reliably distributing the state of the components among all the elements of the system.
One non-limiting advantage of the present invention is that it provides an update service that allows local processes to record, retrieve and distribute state information via table entries in a relational table.
Another non-limiting advantage of the present invention is that it provides an update service that allows processes on a given machine to communicate with a local agent of the update service using a reliable protocol.
Another non-limiting advantage of the present invention is that it provides an update service including a Life Support Service (LSS) process that stores information in separate relational tables for the various types of processes within a distributed computing system.
Another non-limiting advantage of the present invention is that it provides an update service that allows read-write access to relational tables to the LSS process while allowing read-only access to the local processes, which may perform lookups or rescans of the local relational tables.
Another non-limiting advantage of the present invention is that it provides an update service that allows multiple processes on a given machine to perform lookups into the same or different relational tables in parallel without contention and without communication with a server by using a non-blocking coherency algorithm.
Another non-limiting advantage of the present invention is that it provides an update service that allows a specific local process to perform a rescan using a batch processing mechanism when notified of a large number of updates.
Another non-limiting advantage of the present invention is that it provides an update service that allows local updates to be propagated to all other LSS processes in the system.
Another non-limiting advantage of the present invention is that it provides a xe2x80x9cheartbeatxe2x80x9d service that promptly delivers failure notifications.
Another non-limiting advantage of the present invention is that it provides update and heartbeat services that are xe2x80x9clightweightxe2x80x9d and greatly simplified as a result of using a reliable protocol.
According to one aspect of the present invention, a system is provided for monitoring state information in a distributed computing system, including a plurality of nodes which are coupled together by at least one switching fabric. The system includes an update service including a plurality of local applications, each of the local applications respectively residing on a unique one of the plurality of nodes and being adapted to record and update state information from local clients in a local relational table, and a system-wide application which is adapted to propagate the updated state information across the distributed computing system to a plurality of the local relational tables. The system may also include a heartbeat service which is adapted to selectively generate and receive messages throughout the system to indicate whether the components of the system are operating normally.
According to a second aspect of the invention, a distributed file system is provided. The distributed file system includes at least one switching fabric; a plurality of nodes which provide at least one file system service process, and which are communicatively coupled together by the at least one switching fabric; a plurality of local update service applications that respectively reside upon the plurality of nodes and which update state information from local clients on the plurality of nodes in a plurality of local relational tables; and a system wide update service application which communicates updated state information across the distributed file system to a plurality of local relational tables.
According to a third aspect of the invention, a method of monitoring the state of components in a distributed computing system is provided. The distributed computing system includes a plurality of interconnected service nodes, each including at least one local client. The method includes the steps of: monitoring the state of the local clients on each service node; updating information relating to the state of the local clients in a plurality of local relational tables respectively residing on the plurality of service nodes; and communicating the updated state information to the local relational tables on the service nodes over a multicast channel.