Parallel storage systems are widely used in many computing environments. Parallel storage systems provide high degrees of concurrency in which many distributed processes within a parallel application simultaneously access a shared file namespace.
Parallel computing techniques are used in many industries and applications for implementing computationally intensive models or simulations. For example, the Department of Energy uses a large number of distributed compute nodes tightly coupled into a supercomputer to model physics experiments. In the oil and gas industry, parallel computing techniques are often used for computing geological models that help predict the location of natural resources. Generally, each parallel process generates a portion, referred to as a data chunk, of a shared data object.
De-duplication is a common technique to reduce redundant data by eliminating duplicate copies of repeating data. De-duplication is to improve storage utilization and also to reduce the number of bytes that must be sent for network data transfers. Typically, unique chunks of data are identified and stored as “fingerprints” during an analysis process. As the analysis progresses, other chunks are compared to the stored copy and when a match is detected, the redundant chunk is replaced with a reference that points to the stored chunk.
Existing approaches de-duplicate the shared data object after it has been sent to the storage system. The de-duplication is applied to offset ranges on the shared data object in sizes that are pre-defined by the file system.
In parallel computing systems, such as High Performance Computing (HPC) applications, the inherently complex and large datasets increase the resources required for data storage and transmission. A need therefore exists for parallel techniques for de-duplicating data chunks being written to a shared object.