Technical Field
The present disclosure relates to storage systems and, more specifically, to global extent-based de-duplication for one or more storage systems of a cluster.
Background Information
A storage system typically includes one or more storage devices, such as solid state drives (SSDs) embodied as flash storage devices, into which information may be entered, and from which the information may be obtained, as desired. The storage system may implement a high-level module, such as a file system, to logically organize the information stored on the devices as storage containers, such as files or logical units (LUNs). Each storage container may be implemented as a set of data structures, such as data blocks that store data for the storage containers and metadata blocks that describe the data of the storage containers. For example, the metadata may describe, e.g., identify, storage locations on the devices for the data. In addition, the metadata may contain copies of a reference to a storage location for the data (i.e., many-to-one), thereby requiring updates to each copy of the reference when the location of the data changes, e.g., a “cleaning” process. This contributes significantly to write amplification as well as to system complexity (i.e., tracking the references to be updated).
Some types of SSDs, especially those with NAND flash components, may or may not include an internal controller (i.e., inaccessible to a user of the SSD) that moves valid data from old locations to new locations among those components at the granularity of a page (e.g., 8 Kbytes) and then only to previously-erased pages. Thereafter, the old locations where the pages were stored are freed, i.e., the pages are marked for deletion (or as invalid). Typically, the pages are erased exclusively in blocks of 32 or more pages (i.e., 256 KB or more). This process is generally referred to as garbage collection and results in substantial write amplification in the system.
Another source of write amplification occurs in storage systems that use de-duplication to reduce an amount of storage capacity consumed by previously stored data. Such systems may have substantial write amplification because data is typically de-duplicated after it is written to SSD, e.g., by a scrubbing process, and not prior to storage on SSD. The storage of the duplicate data unavoidably contributes to write amplification. That is, duplicate data is not prevented from being written to SSD in the first place, but only erased afterwards. Further, the step of erasing the duplicate data from the SSD itself contributes to write amplification.
Therefore, it is desirable to provide a file system that reduces sources of write amplification from a storage system, wherein the sources of write amplification include, inter alia, 1) storage location reference updates; 2) internal SSD garbage collection; and 3) de-duplication of data after storage.