1. Field of the Invention
The present invention provides a distributed method for managing the assignment of tasks to servers in order to efficiently control server loads and provide for failure recovery in the case of server failure.
2. The Background Art
In electronic systems employing servers to perform mission critical tasks, it is often very important that those tasks are performed in the least amount of time. As the number of those tasks increases, it often becomes necessary to provide multiple servers to handle those tasks. In addition, as the number of requests for the various tasks increases, it often becomes necessary to provide additional servers to handle the higher volume of tasks. It is also necessary that the performance of critical tasks be completely assured, even in the case of server failure.
Since a given task may be required to be performed more often than one or more other tasks, the number of servers that are capable of performing that first task may be greater than the number of servers capable of performing those other tasks.
Each server is typically capable of handling many different types of tasks. In order to insure that the maximum efficiency of a server system is attained, a task manager or gateway is typically provided between a client computer and the one or more servers performing tasks for the client computer.
FIGS. 1A and 1B show an example of operating a typical prior art server system.
Referring to FIG. 1A, prior art system 10 includes client computer 12, gateway computer 14, and servers 16, 18, 20 and 22. In this prior art example, servers 16, 18, 20, and 22 are each configured to perform tasks within overlapping groups. For example, server 16 may be capable of performing tasks A and B. Server 18, in addition to being configured to perform tasks A and B, is configured to perform tasks C, D, and E. Server 20 is configured to perform tasks C, D, and E, in addition to tasks F, G, and H. Finally, server 22 is configured to perform tasks A, B, G, and H.
In this example, tasks A and B are able to be performed on three of the four servers, indicating that those tasks are either performed with high regularity, or are mission critical and thus must be able to be performed by many different servers in case one or more of those servers have a failure.
When a task is required to be performed in a prior art server system, a client computer such as client computer 12 issues a service request to a gateway such as gateway 14. Gateway 14 then chooses a server from a list of available servers contained therein, and assigns the requested service to that server, such as server 20 as seen in FIG. 1B, and server 20 then performs the desired service. Once the service is performed, any data resulting from the performance of that service is passed back to the requesting client computer.
It is important that gateway 14 maintain an accurate list of active servers, so that a task isn't assigned to a failed server. In order for gateway 14 to have an accurate list at any given time of servers which are active and thus failure free, gateway 14 performs a verification through a simple communications means such as a ping. As those of ordinary skill in the art are readily aware, a ping is a simple data packet transmitted from a first network object to a second network object which stimulates a simple response from the second network object which tells the first network object that the second object is active. If gateway 14 continues to receive ping responses from any or all of servers 16, 18, 20, and 22, gateway 14 will keep each of those servers on its list of active servers. Failure to receive a predetermined number of consecutive pings from a given server, server 18 for example, will result in gateway 14 removing that server from the list of active servers. Future service requests, such as for the performance of task B, that would otherwise have been directed to the failed server 18 would then instead be directed to a back-up server such as server 20.
In single processor environments, gateway 14 is configured to include information about actual physical server connections, a mechanism to survey the connections and status of the servers, and an assignment mechanism to assign service requests to particular server connections with a tightly coupled control.
In parallel computing environments, the active server list kept by gateway 14 is typically includes detailed information about the location of servers which can perform the various tasks. That detailed information often includes the processor number, slot number, machine number, physical port number, etc. The communications methods employed in these multiprocessor situations is often integrated into the operating system kernel in order to achieve maximum processing efficiency.
In addition to keeping a list of active servers, gateway 14 is also responsible for load balancing. Load balancing is used to spread out data traffic or a computing load across various capable machines. Thus, in the example above with respect to tasks A and B, servers 16, 18, and 20 are all capable of performing those tasks. If a request for task A arrives at gateway 14 from a client computer such as client computer 12, gateway 14 may choose from among servers 16, 18, and 20 for the performance of task A. Thus, if server 16 is busy and perhaps has several other service requests pending which have not yet been performed, but server 18 is either idle or has fewer requests pending, gateway 14 would assign this task A to be performed by server 18 instead of by server 16, in order to balance the computing load. Other lower-level load balancing techniques are known to those of ordinary skill in the art.
While the prior art systems are useful for their intended purposes, in order for those systems to work, gateway 14 must be tightly coupled to each server, and know the status of each server on a moment by moment basis. It would be beneficial to provide a system for performing task assignments, fail-over, and load balancing using a system which can be more loosely coupled but also operates very efficiently. The present invention provides such a system.