As the demand for data storage continues to increase, larger and more sophisticated storage systems are being designed and deployed. Many large scale data storage systems utilize storage appliances that include arrays of storage media. Multiple storage appliances may be networked together to form a cluster, which allows for an increase in the volume of stored data. The increase in the number of components, the number of users, and the volume of data often results in disparate users creating separate but identical copies of data, leading to exponential growth in physical storage capacity. For example, multiple members of a business may use the same operating system or store the same document. In such cases, data deduplication technologies can significantly increase data storage efficiency and reduce cost. Data deduplication technologies remove redundancy from stored data by storing unique data a single time and subsequent, redundant copies of that data as indices in a deduplication table pointing to the unique data. As a result, data can be stored in a fraction of the physical space that would otherwise be required. For example, 100 copies of a 10 gigabyte (GB) operating system can be stored with 10 GB of physical capacity, and 1000 copies of the same 1 megabyte (MB) file can be stored with 1 MB of physical capacity.
Memory caching is widely used in data storage systems. Reading from and writing to cache memory is significantly faster than accessing other storage media, such as accessing spinning media. Data deduplication involves performing a lookup into the deduplication table prior to writing data to determine if the data is a duplicate of existing data. As such, to perform deduplication efficiently and not impact system response time, many data storage systems store the deduplication table in cache memory, such as a direct random access memory (DRAM) based cache. However, cache memory remains significantly more expensive than other storage media. Consequently, cache memory is usually only a fraction of the size of other storage media in a data storage system.
In some cases, a deduplication table can grow unbounded, beyond the size of the available memory cache. While this allows for the deduplication of arbitrary amounts of data storage, portions of the deduplication table may be evicted from the memory cache as the size of the deduplication table grows. Specifically, if cache memory is full, existing data must be evicted from the cache memory before new data may be stored. Many caching systems and methods evict data based on algorithms that track recency (evicting data that has been least recently used), frequency (evicting data that has been least frequently used), or some combination of recency and frequency. However, such algorithms fail to identify the importance of data, resulting in important data that is not recently or frequently used, such as all or portions of the deduplication table, being evicted from the memory cache into other storage media, such as flash or spinning disks. When all or a portion of the deduplication table is stored in flash or disks, read and write request overhead is substantially increased, resulting in significantly reduced system performance.
It is with these observations in mind, among others, that various aspects of the present disclosure were conceived and developed.