As computer systems become more and more distributed, the need for coordination increases. Distributed systems come in many forms. In some, a group of structurally identical or at least similar physical and/or virtualized processing systems perform essentially independent tasks, but may all benefit from coordination of, for example, software updates. In some others, the various processing systems perform independent parts of a single task, and in still others, such as distributed storage systems, the different systems appear from a functional level as a single entity. Coordination is, in some of these systems, necessary, and in most of them it is at least advantageous.
One obvious way to coordinate, for example, software installations or updates, among different systems is simply to stop their processing, individually or as a group, perform the installation/update, and then restart their processing. This procedure often disrupts availability, however, for longer than users prefer or can tolerate. As just one example, virtualized systems running VMware virtual machines (VMs) do not do collective upgrades; rather, what is recommended is putting hosts into a “maintenance mode” before upgrading, which means migrating the load off of the affected hosts, then upgrading, then migrating VMs from the next host to that one, followed by upgrading the next host, and so forth. The vCenter management software must therefore be able to handle the hosts running different software versions.
The need for efficient coordination is particularly acute in distributed storage systems, since each time one is taken offline for a software change, the system as a whole may become useless. In such distributed storage systems, not only data sets (defined in the broadest sense as any related collection of digital information, including both executable and non-executable data) as a whole but even different portions of single data sets may be stored on different devices, for example, as RAID stripes. Indeed, even unsophisticated users nowadays interact with storage systems in the “cloud”, such that they may have no idea on which continent(s), much less on which server(s) or disk(s), their data resides. In such systems, there is typically some form of main, or “host” server, which is responsible for coordinating the read/write tasks directed to controllers in the various storage devices/systems. Efficient coordination of software changes on different member devices in such a distributed storage system presents various challenges:
1) Existing tools for managing host software and configuration (for example, Puppet) do not also manage storage appliance software; different tools are therefore often required for the host and controller side, which complicates software version management.
2) The diversity of tools for hosts and controllers means that upgrades are not easily coordinated across all the hosts and controllers. The nodes in the system may therefore not, in general, be running the same version of software at the same time. This means in turn that the system builder is faced with two choices: ensure that the different versions of the software interoperate, which adds significant complexity and software development expense, or shut the system down until all nodes are upgraded to the same release and then restart the system. As mentioned above, however, such an upgrade can take an unacceptably, or at best undesirably long, time and cause a significant outage.
3) Upgrading host-side storage software may require the host itself to be rebooted, which in and of itself causes an outage. Some virtualized server environments (like VMware) address the problem by sequentially putting hosts into a “maintenance mode”. This is disruptive and slow, however, and will generally make upgrading storage software in this sort of distributed system more complex and onerous than upgrading just an independent storage controller that does not rely on software running on the hosts.
What is needed is therefore some mechanism and method for more efficiently allowing for software upgrades (defined as including installations, updates and other changes to or replacements of existing installations, etc.) on the different members of a distributed system, of which storage systems are but one example.