Many web sites and other applications employ a set of servers to serve web pages, files and perform other applications. Some of the servers may duplicate information or functions of other servers to allow more users to simultaneously access the server farm than could be supported by one server alone. A load balancer may be coupled between the servers in the set of servers and the clients that request service from it to allow the load balancer to spread requests for service that may be made to a single address across the multiple servers in the server farm.
As the files and applications on the file servers become out of date, they may be updated. Automated update facilities may be used to update the set of servers. One way to update the servers is to first instruct the load balancer to stop accepting new connections and after a period of time, instruct the load balancer to additionally stop serving existing connections, which brings the servers “off-line”. The servers may then be updated using an automated facility, and once updated, the servers are brought back on-line by instructing the load balancer to resume directing connections to the servers.
If all the servers are updated at the same time, automated updating facilities may be used. However, if the servers are all updated at the same time, they are unavailable to perform any work during the update process, and such updates can be the cause of regularly scheduled “outages” of web sites, inconveniencing users. Although it is theoretically possible to manually update servers a portion at a time, doing so would require a human operator to step the automated updating facilities through each of the servers one at a time, which for a large set of servers, could require hours of costly operator time and be prone to error. Other arrangements could be used to speed the process, such as by manually bringing the servers off-line in groups, updating the group and then bringing them back on-line, but such other arrangements are more complex and prone to error and may not avoid the problem of having the entire site brought down if the servers are assigned to specialized functions and the group the operator brings off-line for updating contain all the servers that perform a certain function.
There is another problem with any of these approaches. It is desirable to avoid bringing servers on-line after an update that won't operate properly. For example, servers that were off-line before the update due to malfunctions, or servers for which the update could not be installed properly should not be brought on-line after an update because they may not operate properly. If left off-line after the update, other servers that perform their functions will be used instead.
What is needed is a system and method that can update a set of servers in an automated fashion, without bringing down the entire set of servers and without restoring to service servers that may not operate properly.