The present invention relates in general to data processing systems, and in particular, to a method and a system for storing data files in a file system.
Some types of applications have the characteristic of storing large numbers highly redundant (similar) unstructured data objects (files) in a file system. One example is an application processing and storing genomic sequence data of a large number of individuals of the same species. Such applications are being used more and more in the life science industry generating significant amount of data volume and storing these as a plurality of files in file systems. In the case of applications for genomic sequence data the scanning speed of genetic sequencers increases exponentially with each new generation leading to even more data hardly to be stored on storage devices for reasonable cost. Genetic sequencers use the application programming interface (API) of a file system. For network attached storage (NAS) the data are sent via a network protocol like Network File System protocol (NFS) or Server Message Broadcast protocol (SMB) or other alternative protocols to store the data in the NAS device using a file system internally. There are other application areas also generating very similar content to be stored in multiple files, for example applications recording, processing and storing seismic exploration data.
Some storage systems optimize storage capacity by eliminating identical copies of stored data. In some cases, stored data is divided into segments. A new segment that is desired to be stored is first compared against those segments already stored. If an identical segment is already stored on the system, a reference to that segment is stored instead of storing the new segment. This is referred to as identity compression.
Despite increasing capacities of storage systems and network links, there are often benefits to reducing the size of file objects that are stored and/or transmitted. Examples of environments that would benefit include mobile devices with limited storage, communication over telephone links, or storage of reference data, which is data that is written, saved permanently, and often never again accessed. Other examples include wide-area transfers of large objects, such as scientific data sets, or over saturated links. For example in self-contained storage systems, in which all data is stored in a single location, data can take the form of files in a file system, objects in a database, or other storage device.
Numerous techniques for reducing large object sizes exist including data compression, duplicate suppression, and delta encoding. Data compression is the elimination of redundancy internally within an object. Duplicate suppression is the process of eliminating redundancy caused by identical objects. Delta encoding or compression eliminates redundancy of an object relative to another object, which may be an earlier version of the object having the same name. A delta compression method, for example, optimizes storage capacity by comparing a new segment that is desired to be stored against those segments already stored and looking for a similar though not necessarily identical segment. If a similar segment is already stored on the system, a delta between the old and new segment is computed and a reference to the old segment and the delta is stored in place of the entire new segment.
In US 2011/0196869 A1 a method for cluster storage is disclosed. A storage system uses a cluster of nodes to store in-coming data. In-coming data is segmented. Each segment is characterized for assignment for storage on a given node. On the given node of the cluster, segments are stored in a manner that deduplicates segment storage.
Segments are deduplicated on each node of the cluster using delta compression. Delta compression allows the use of large segments for distributing efficiently to nodes so that sequential bytes are stored close to each other on disk. Delta compression efficiently stores segments that are similar to each other by storing one base and, for other similar segments, storing only a delta from the base along with a reference to the base. If a segment is not similar to a previously stored base, the new segment is stored as a new base and possibly a delta from that base.