Distributing data storage over different storage servers in a computer system or computer network has many advantages. Increasing capacity is a clear advantage; not needing to fit all available storage into a single device allows for a much larger storage space. Another advantage is flexibility. System designers can place storage devices substantially anywhere, in any configuration, with any combination of technologies, as long as some form of network is available to connect the entities that need to access the storage servers and the storage servers themselves, either directly or through some form of proxies.
The advantages of distributed storage systems, however, are accompanied by challenges. For example, two challenges in this context are coordination and issues of latency. Assume, for example, that two different entities need to write to the same portion of the available address space at almost the same time. It becomes problematic to determine which set of write data should be present in storage at the end of these two write operations, especially in an environment in which the time it takes for each entity to communicate its write request may be different because of differences in network latency and where the two sets of write data may be different.
One known answer to this race condition involves locks. In lock-based systems, whichever entity obtains and holds a lock is allowed to complete its write operation to a given resource, while other entities must wait. Although this resolves write conflicts, it does so at the cost of inefficiency, as contested locks add latency and, furthermore, require a lock server and a lock manager. Moreover, lock managers themselves suffer from lack of scalability, inasmuch as the prevalence of false conflicts must be balanced against the granularity of the lock. Lock management becomes even more complex in distributed systems.
In a similar vein, some systems require writing entities to obtain tokens in order to write, which then requires relatively complicated mechanisms for distributing the tokens, for arbitrating which entity gets which token, and for determining the tokens' order of priority. In some other systems, a supervisory system issues a transaction ID for each I/O operation, but this arrangement often adds significant delay.
Still other known solutions synchronize writes relative to logical clocks, such as a vector clock or a Lamport timestamp to be able to detect causality violations. The main disadvantages of such systems are that they require not only communication across storage clusters, but also usually entail coordinating signals over several round trips—virtual clocks have high overhead when a total order is desired, as they require many more messages. Furthermore, even total-order versions of virtual clocks can incur causality violations when writing entities communicate directly with each other, without passing these communications, in particular I/O requests, through the higher level storage system controllers.
Partial-order versions of arrangements based on virtual clocks are less onerous in terms of overhead but are more likely to expose causality violations; virtual clocks preserve causality only internal to the storage system. If the applications using the storage system can also communicate “out of band”, for example, using sockets or time or another independent storage cluster, the virtual clocks will be unaware of all the causality relationships and the storage system can appear to violate causality.