Distributed shared memory (DSM) provides an abstraction that allows users to view a physically distributed memory of a distributed system as a virtual shared address space. DSM provides a convenience for programmers of distributed applications, reducing or eliminating the requirement to be aware of the distributed architecture of the system and the requirement to use a less intuitive form of communication on a distributed system via message passing. DSM also provides a means to directly port software written for non-distributed systems to work on distributed systems.
There are many forms of DSM algorithms and technologies, all of them sharing a fundamental architecture of being composed of distributed agents deployed on a plurality of clustered nodes, maintaining local data structures and memory segments, and using a communication protocol over a message passing layer to coordinate operations. Message traffic should be minimized for a given load of work, and memory coherency should be maintained.
Users of a file system may need a transactional interface and method of operation for operating on files. Fundamentally, users may require that multiple updates applied on multiple segments within multiple files are associated with a single transaction, such that either all the updates within a transaction are applied to the files or alternatively none of the changes are applied. Further requirements may be the following: Enable to roll-back an ongoing transaction, by restoring the state of the files on which the transaction operated to the state preceding the beginning of the transaction. Upon confirmation of the file system on committing a transaction, the operations of the transaction are guaranteed to be durable and apply on the relevant files regardless of any fault that may occur after that confirmation. In case a fault occurs before a transaction is confirmed by the file system, it is guaranteed that no operations related to this transaction are applied on the relevant files, and the state is restored to the point after the last confirmed transaction. Furthermore, transactions are initiated concurrently by multiple users, and should be processed by the file system as concurrently as possible. Specifically, transactions that update disjoint portions of the file system should be processed concurrently, while transactions that share updated portions should be serialized. Moreover, users performing read only operations should be allowed to access the file system concurrently, while users performing transactions should be mutual exclusive and serialized with all other users that access the same file system portions affected by these transactions. Basically, all transactions should be isolated, in the sense that no operation external to a transaction can view the data in an intermediate state.
Existing file systems generally do not support these requirements. Known systems include journaling file systems where journal based transaction processing is applied to file system operations. Such file systems maintain a journal of the updates they intend to apply on their disk structures, and periodically apply these updates, via the checkpoint process, on the actual disk structures. After a systems fault, recovery involves scanning the journal and replaying updates selectively until the file system is consistent. However, in journaling file systems, the operations on which transactional consistency is applied are file system operations defined according to the file system logics, rather than user oriented operations applied to the file system. In other words, transactional processing in such file systems protects the atomicity, consistency, isolation and durability of file system operations, rather than user operations which are more complex.
Journaling file systems typically define a single write or update operation issued by a user as a transaction. Such an operation generally involves several internal update operations on file system metadata structures and user data structures. Occurrence of faults (like a power failure or a system unrecoverable fault) during processing of these internal operations can leave the file system in an invalid intermediate state. Grouping these internal operations into a transaction enables the file system to maintain its consistency, considering possible failures during processing, relative to individual user operations on the file system. However, the requirement of considering several user operations, defined and grouped by the user logic, as a single atomic transaction, and the subsequent requirements facilitating transaction processing of user oriented operations, remain unanswered in existing file systems. Some journaling file systems group several operations within a transaction, but this is done according to the file system logic and mechanisms, and without consideration of user logic. Journaling file systems also differ in the type of information written to the journal, which may be blocks of metadata and user data after the updates, or alternatively some other compact description of the updates.
Note that in non-journaled file systems, detecting and recovering from inconsistencies due to faults during processing requires a complete scan of the file system data structures, which may take a long time. In both journaled and non-journaled file systems users are blocked until the recovery process completes.
In clustered (a.k.a. shared disk) file systems, which provide concurrent read and write access for multiple clustered computers to files stored in shared external storage devices, transaction processing and consistency should be implemented over the cluster and is more challenging. For example, a clustered file system should typically support an on-line recovery process, where an operational computer in the cluster recovers the consistency of the file system, during normal work in the cluster, after failure of other computers in the cluster.