The present invention relates to data storage, and more specifically, this invention relates to data deduplication in a primary storage environment.
Storage systems which store large amounts of data sparsely written within a virtual namespace can partition the namespace into regions, each region being managed as a non-overlapping portion of the namespace. As an example, a block storage system may provision many volumes, each volume having an address space of many gigabytes (GBs). Similarly, each volume may include a plurality of regions, and a region may span 1-100 megabytes (MBs) within the volume. Thus, each volume is partitioned into multiple regions, each managing data stored in their own namespace.
Furthermore, in a primary storage system which is dominated by complex read and write data accesses of relatively small size (e.g. 4 KB or 64 KB), performance is often a key requirement and therefore persistent metadata utilized to service data requests must be primarily referenced while in fast-access memory. In conventional storage systems, it is not always possible to keep all metadata needed to efficiently manage the entire namespace in fast-access memory, as the amount of metadata necessary for such management may exceed the available memory.
The amount of metadata necessary for efficient management of a namespace may also increase in systems employing data deduplication to maximize the amount of available storage in the system. Data deduplication generally involves the identification of duplicate (triplicate, etc.) data portions, e.g. on different volumes or regions within the namespace, and reduction of the amount of storage consumed by freeing the storage space associated with all but one (or a relatively small number in cases where redundancy is desirable) copy of the data. To maintain consistency and provide access to the data, references such as pointers, etc. may be implemented to direct access requests to the single retained copy.
While deduplication effectively increases available storage compared to retaining a plurality of redundant duplicates, the technique requires additional metadata to manage the references pointing from the duplicated location to the retained data location.
In addition, primary storage systems are distinct from backup storage systems in which conventional deduplication techniques are employed, in that the size of the data portions used for detecting presence of duplicates is much less than that used for deduplication in backup storage systems. This further increases the amount of metadata necessary to manage the storage system, exacerbating the impact on overall system performance.
This is especially the case for primary storage systems which, distinct from backup storage systems, must perform deduplication as data arrives rather than periodically according to a deduplication schedule. In addition, for primary storage systems performance is largely measured according to input/output throughput, and when coupled with the relatively small data portion size used to detect duplicates, the need to identify duplicates at time of arrival (e.g. receipt of a write request) is a significant and detrimental impact on system performance.
Accordingly, efficiently managing the metadata in fast-access memory is of great significance, particularly for primary storage systems for which conventional deduplication techniques are not suitable. It would therefore be beneficial to provide techniques, systems, and corresponding computer program products for efficiently managing deduplication metadata in the context of primary storage systems.