Parallel storage systems provide high degrees of concurrency in which many distributed processes within a distributed application simultaneously access a shared file namespace. Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations. Due to their tightly coupled nature, many of these distributed applications perform bulk synchronous input/output (IO) operations in which they alternate between compute phases and state capture phases. Typically, the state capture phase comprises bulk synchronous state storage in which all processes call a barrier operation (i.e., a fence) and perform their state storage synchronously. In this manner, there are no outstanding messages being processed during the state capture phase that might cause inconsistencies in the distributed state capture.
Unfortunately, the synchronous nature of the distributed state capture creates several problems. For example, the storage system must support the full bandwidth of all of the distributed processes for short bursts of time and is otherwise idle. In addition, the computational resources on which the distributed processes execute will be unnecessarily idle when fast processes wait at the barrier for slower processes. Thus, the bulk synchronous IO model for distributed state capture causes inefficient use of both compute servers and storage servers.
A number of techniques have been proposed to reduce the overall application runtime and lessen the peak bandwidth requirement of the storage system by using asynchronous checkpoint techniques. Message logging-based asynchronous checkpoint techniques require the logging of all messages since the checkpoints do not correspond to a synchronous moment in the state of the distributed data structure. The complete state is reconstructed from the asynchronous checkpoints and the logged messages. Transaction-based asynchronous checkpoint systems employ coordination within the distributed storage system to ensure a consistent checkpoint data set.
Due to the high costs of conversion, however, customers are reluctant to transform existing distributed applications to employ asynchronous modifications of shared data objects by the various processes within the distributed application. A need therefore exists for methods and apparatus for simulating asynchronous modifications of shared data objects by a number of distributed processes within a distributed application, in order to evaluate the benefits of such a conversion. A further need exists for techniques for identifying and quantifying a degree to which various asynchronous program characteristics improve overall performance of the distributed application or reduce the required capabilities of the storage system.