Client-server applications are increasingly being deployed across multiple servers. These servers may or may not reside at different geographical locations and, together, provide the back-end processing power for the specific applications. For example, the servers could support a content delivery network, such as geographically distributed Web cache proxies that cache web pages and respond to requests from client Web browsers. The servers could also be general-purpose computing machines (PCs, workstations, . . . ) of a GRID facility deployed on the Internet where each server receives and processes tasks submitted by the GRID client-users. The servers could also be database servers, such as shared-disk or shared memory parallel database servers or replication database servers. Similarly, peer-to-peer applications are also being deployed across multiple computing machines, with any given peer from among a group of peers processing a request from another peer/servent (Note that client-server terminology and examples will be used to describe our invention for ease of description. However, it should be understood that our invention is also applicable to other architecture/applications, including peer-to-peer architectures.)
For description purposes, assume there are m servers (numbered 0, 1, . . . , m−1) directed at processing requests/tasks for a particular application and any arbitrary number of clients that may send requests/tasks to these servers. Traditionally, in these multi-server environments, the server among the m servers that initially receives a given client request services that request and sends the result back to the client. However, these multiple server environments are increasingly using request routing in order to service client requests. Under request routing, the server that actually receives a client request will use some scheme to determine another of the m servers and will then forward the request to this determined server for processing. For example, FIG. 1 shows an example network 100 of servers 102-105 and clients 110-115 that send requests/tasks to these servers for processing. Assume client 112 sends a request 120 to server 103 (the choice of server 103 can either be a random or a predetermined choice). Upon receiving this request, server 103 runs a request routing scheme to determine another server to actually service the request, such as server 105. Server 103 then forwards the request 120 from client 112 to this server 105. Server 105 then processes this request and returns the result 121 to server 103, which then forward the result to client 112.
Administrators use request routing schemes in these multi-server environments for different purposes, such as for routing a request to the server that is more likely to have the content information the client is seeking, routing the request to server/network based on proximity to the client, routing a request based on bandwidth availability, and routing a request in order to balance load among the servers. The latter use, load balancing, is of particular concern here. More specifically, given a multi-server environment of m servers supporting any number of clients, an increasing use of request routing is to distribute client requests among the servers in order to achieve good performance scalability. Load balancing request routing schemes ensure the load at each of the m servers grows and shrinks uniformly as the client request arrival rates increase and decrease, thereby ensuring overall shorter response times (e.g. web page download time, task completion time), higher throughput, and higher availability to client requests. Nonetheless, load balancing request routing schemes that are scalable as the client request rate and/or number of servers increases and that achieve balanced load among distributed servers are difficult to obtain because of the dynamic nature of the overall system and because of the unpredictable task arrival patterns and task sizes.
Several request routing schemes/methods have been used in the past for load balancing, including: (1) “lowest load”, (2) “two random choices”, (3) “random”, and (4) “round-robin”. The “lowest load” request routing method depends on a server knowing the loads of all servers when a client request is received. Specifically, this method is typically implemented in either a decentralized or centralized fashion. Under the decentralized implementation, any of the m servers can initially receive a client request. When a server receives a request, it determines the server among the group of m servers that currently has the lowest load and then routes/forwards the request to that server for processing. Under the centralized implementation a dispatcher is used. This dispatcher initially receives any given client request and then forwards the request to the server with the currently lowest load.
Regardless of the implementation, the “lowest load” method optimally distributes load among the servers when the dispatcher/initial server knows the loads of all other servers at the instance it receives and forwards a request. Under these conditions, the lowest load method is able to balance the load among the servers and is scalable, with the overall response time to client requests slowly increasing as the client request rate increases. However, if these ideal conditions are not met and the current load information at the servers is not accurate (i.e., becomes stale), the load balancing becomes less accurate causing the average response times to client requests to drastically increase.
One method by which load information is disseminated among servers under the “lowest load” method is through a polling method. Here, the dispatcher or each server periodically polls other servers for their current load. Ideally, the polling rate is set very high such that the dispatcher/servers stay current as to the current loads among the other servers. However, polling requires message overhead on the order of O(m) per dispatcher/server. Similarly, as a given network grows and the number of servers m increases, the polling burden at the dispatcher/servers also increases. Hence, there is a tradeoff between a high/adequate polling rate, which increases overhead but keeps the load information current, verses a low polling rate, which reduces overhead but produces stale load information.
The piggyback method is an alternative to the polling method and its correspondingly high messaging overhead. Typically, when a server forwards a request to another server for processing, the processing server will return a response to the forwarding server. Under the piggyback method, the processing server also sends its current load to the forwarding server when returning this response. The forwarding server uses this load when processing subsequent client requests. As a result, this method does not suffer from the overhead issues of the polling method. Nonetheless, like above, if load information is not current at each server, the server may forward client requests to another server that is not the least loaded, causing the average response time to client requests to increase.
More specifically, dissemination of load information under the piggyback method is directly tied to the request rate. An increase in the request rate means that each server receives initial client requests more frequently, which means each server forwards requests more frequently and in turn receives load information more frequently. Hence, if the request rate is too low, load information is not kept current. Somewhat related to this problem, as a given network grows and the number of servers m increase, it becomes more difficult for each server to remain current on all other servers because the requests are more broadly distributed/disbursed. Notably, the dispatcher method may overcome some of these issues, but the dispatcher then becomes a bottleneck and a single point of failure to the system.
The “lowest load” method also suffers from the “flashing crowd” problem, which is directly related to the staleness of load information. In general, assume a given server has a relatively lower load than the other servers. If load information on this server is not being disseminated frequently enough to all other servers, the other servers will consistently determine this server is under-loaded and will all re-direct their requests to this server causing this server to suddenly become overloaded. The problem then cascades. The remaining servers now sense the next lowest loaded server and again re-direct their requests to it, causing this server to become overload. This scenario continues in turn on each of the servers ultimately defeating the original intent of balancing the load.
Turning to the “two random choices” method, here each time a server initially receives a request from a client it selects two other servers at random uniformly among all servers. The initial server then compares the loads of the two randomly selected servers and forwards the request for processing to the server with the lesser load. For example, in FIG. 1 assume client 110 sends a request to server 103. Upon receiving this request, server 103 randomly determines two other servers, such as server 102 and server 104. Server 103 then compares the current load of server 102 to the current load of server 104 and forwards the request to the server with the lesser load. This server then returns the result to server 103, which forwards the result to the client 110.
Similar to the “lowest load” method, the “two random choices” method ideally requires each server to know the loads of all other servers (as the two randomly selected servers can be any of the servers) at the instance a request is being forwarded. Assuming these ideal conditions are met, the “two random choices” method performs and scales almost as well as the “lowest load” method, with the overall response time to client requests increasing slowly as the client request rate increases. However, like above, the “two random choices” method in practice uses the piggyback method or the polling method, which requires a message overhead on the order of O(m) per server. As such, the two “random choices” method has the same issues as the “lowest load” method as described above; if the load information is not disseminated often enough among the servers, the information at each server becomes stale and, as a result, the average response time to client requests drastically increases. Accordingly, this method can also suffer from the flashing crowd problem.
Turing to the “random request” routing method, here each time a server initially receives a request from a client it forwards the request to another server chosen at random uniformly among all servers. Because load information is never used, this method avoids all the shortcomings encountered under the “lowest load” and “two random choices” methods in passing load information around. There is no messaging overhead and, as such, no staleness issue. Accordingly, this method does not suffer from the “flashing crowd” problem and is not adversely affected as the number of servers m increases, with the response time to client requests remaining constant.
However, it has been proven as well as experimentally shown that the random request method does not scale well and does not equally spread the load among the m servers. More specifically, as the client request rate increases, some servers become more heavily loaded than others and reach their maximum load capacity earlier than others. As a result, the overall response time to client requests among the m servers increases as the overloaded servers become unavailable or experience delay in processing the requests. As such, assuming the load information under the “lowest load” and “two random choices” methods remains accurate, these two methods perform substantially better than the “random” method.
Turning to the “round-robin” request routing method, for each request a server initially receives from a client, the server successively forwards the requests in a round-robin fashion to other servers for processing (i.e., the initial server forwards request a to server i, forwards request b to server i+1, forwards request c to server i+2, etc.). This mechanism avoids the use of random number generators to choose a server and again, avoids the downside of having to pass load information among the servers. In general, however, it is commonly known that this method has the same issues as the “random” method with respect to scalability. As the client request rate increases, some servers become more heavily loaded than others causing response times to client requests to rapidly increase. In addition, it is possible for the servers to become synchronized under this method and for each to forward its requests to the same servers in a progressive fashion, thereby causing the “flashing crowd” scenario.
As indicated, the “lowest load” and “two random choices” methods perform substantially better than the “random” and “round-robin” methods, assuming the load information does not become too stale. There are still other request routing methods that rely on the passing of load information and that have been shown to balance loads well, even when the load information is stale. However, like the “lowest load” and “two random choices” methods, these other methods require substantial load messaging overhead when polling is used. More importantly, these other methods assume that all servers previously know the overall client request arrival rate, which is typically not realistic.
Overall, the prior methods for request routing load balancing have several drawbacks. The “lowest load” and “two random choices” methods perform well and scale as the request rate increases. However, these methods rely on knowing the load of all other servers and that this information remains accurate. The polling method can provide this accuracy, but at the expense of high messaging overhead. The piggyback method overcomes the messaging overhead problem, but does not keep all servers accurate unless the request rate is high. These methods also suffer from the flashing crowd problem. Other methods are less affected by staleness of load information and perform as well as these two methods; however, these other methods rely on all servers knowing the request arrival rate, which is not practical. The “random” and “round-robin” methods do not require the passing of load information and thereby avoid the associated problems, but these methods do not scale well, with performance quickly degrading as the request arrival rate increases.