Clusters of workstations are becoming popular alternatives to large scale computing systems. For example, a network of workstations can act as a server farm, where the processing loads are distributed among available workstations, for serving World Wide Web access. A network of shared workstations are an attractive alternative to large scale computer systems because the workstations can share the necessary processing of jobs across the multiple workstations.
On a network of shared workstations, load balancing is the idea of migrating processing jobs across the network from host workstations with high loads to host workstations with lower loads. The motivation for load balancing is to reduce the average completion time of a processing job and to improve the utilization of the workstations.
Two broad classes of load balancing mechanisms exist: 1) preemptive load balancing and 2) non-preemptive load balancing. Preemptive load balancing mechanisms suspend the processing of a job, move that suspended job to a remote host and then restart the processing of that job. Preemptive load balancing mechanisms are very complex because they involve checkpointing (saving of the state of a processing job), moving state information, as well as moving a network transport level connection to another workstation processor station.
Non-preemptive load balancing mechanisms control the distribution of processing jobs for remote execution based on a priori knowledge of the workstations' behavior. In a typical known system using non-preemptive load balancing, workstations on the network are polled to determine their availability to receive processing jobs for remote execution.
In one known system, for example, a central coordinator connected to the network polls the other workstations on the network to determine their availability. Background jobs are off-loaded to available remote workstations until local activity at those remote workstations is detected; upon detecting local activity, background jobs are preempted so that the user can use the workstation. See "Condor a Hunter of Idle Workstations" by Michael Litzkow, et al., Proceedings of the 8th International Conference on Distributed Computing Systems, June 1988.
In another known system, each workstation on a network determines the availability of other workstations. First, each workstation on the network locally determines if the number of jobs waiting for execution exceeds a predetermined threshold. If the number of waiting jobs is below the threshold, then the jobs are processed locally. If the number of waiting jobs exceeds the threshold, then the local workstation randomly probes a remote workstation in the network to determine whether the number of processing jobs already in service or waiting for service at that remote workstation are less than some threshold value. Probing the remote workstation is performed at the time the processing job is ready for processing; consequently, the processing job waits for processing until this probing for an available remote workstation is completed. If that probed workstation is available, then the processing job is transferred to that workstation regardless of the state of that workstation when the job arrives. If the probed workstation is not available, then additional workstations are probed until an available workstation is found or until a timeout occurs. In the latter case, the local workstation must process the job. See "Adaptive Load Sharing in Homogenous Distributed Systems" by Derek Eager, et al., IEEE Transactions on Software Engineering, Vol. 12, No. 5, May 1986.
These known systems, however, suffer several shortcomings. First, these systems do not scale well as the number of workstations on the network increases. When there are a large number of workstations on the network, the amount of information exchanged among the workstations increases with the number of workstations available for load balancing. As a result, the overall performance of the network does not improve as the number of workstations available on a network increases.
Second, these known systems suffer from latency problems associated with seeking the state information prior to moving a processing job and from the inefficient use of the communication channel connecting the processors. In other words, while the state information of a remote processor station is being sought, the processing job is waiting to be processed. This waiting causes increased delay before the processing job can be processed.
Additionally, because the sender or receiver in these systems initiate the transfer of a processing job for remote execution, a substantial period of time can elapse from when the local workstation receives state information about the target workstation and when the target processor station actually receives the processing job for remote execution. During this elapsed period of time, a previously available target workstation can become suddenly unavailable due to a large number of queued processor jobs recently received for remote execution. Any late arriving processing jobs received at a suddenly busy target workstation must be redistributed to other processor stations that are more available. This process can result in multiple transfers of processing jobs to previously available target workstations that, in fact, are not available for remotely executing processing jobs.