Grid computing typically networks server computers together into a working group to accomplish a common computer processing task. The processing task is divided into smaller subtasks, with each subtask being assigned to an individual server computer in the group. The subtasks are generally performed simultaneously with the other subtasks being performed by the other server computers in the group. The servers accomplish work on the processing task simultaneously, decreasing the amount of time necessary for its completion.
Most grid computing architectures include a centralized server computer, or host facility, that transmits an information request, e.g., a ping, to each server computer in the working group. Each server computer also sends ping messages to its neighbors in the grid. The ping messages assess the health and availability of server computers in the working group, as well as the health of the working group system as a whole. Health generally regards the proper functioning of the computer server and its ability to perform its expected role. Availability relates to health and regards the computer server's responsiveness and its current state. For instance, availability may pertain to whether the computer server's resources are currently dedicated to another processing task, inhibiting its ability to assume a new processing task.
In the typical grid, or distributed network (hereinafter referred to simply as grid, or, the grid), each server in the working group communicates with its neighbor, and that neighbor forwards the health and availability information on through successive computer servers, which similarly forward their own health and availability information. This succession leads to thousands of pings sent between computer servers of the working group.
A ping message, or request, is sent to each server in the working group for each other server in the working group. Thus, for a working group of 10 servers implementing grid computing, a total of 9 ping messages would have to be sent to each server computer (a total of 90 for the entire system) to set up, monitor, and maintain the grid. If a working group contains 1,000 servers, there will be a total of 999,000 (1000×(1000−1)) ping messages sent to set up, monitor, and maintain the grid. The generation, sending, and processing of these ping messages represent a significant amount of overhead in the system that consumes much of the processing capacity otherwise available to perform the common processing task.
A hop represents each time the health or availability information for a given computer server must pass through or be passed along by another server computer before reaching the centralized management server, or facility. The more hops in a given system, i.e., the more server computers a ping message must pass through, the more overhead is used in monitoring and management. This overhead translates into less efficiency as the system's finite resources are wasted on monitoring and management. Put simply, the more hops the ping message is passed through, the longer failure detection will take and the less responsive the server will be.
Consequently, there exists a need for an improved grid architecture and associated methodology for more efficiently assessing resource health and availability.