Technical Field
The present disclosure relates to storage systems and, more specifically, to a flash optimized, log-structured layer of a file system of one or more storage systems of a cluster.
Background Information
A storage system typically includes one or more storage devices, such as solid state drives (SSDs) embodied as flash storage devices, into which information may be entered, and from which the information may be obtained, as desired. The storage system may implement a high-level module, such as a file system, to logically organize the information stored on the devices as storage containers, such as files or logical units (LUNs). Each storage container may be implemented as a set of data structures, such as data blocks that store data for the storage containers and metadata blocks that describe the data of the storage containers. For example, the metadata may describe, e.g., identify, storage locations on the devices for the data. In addition, the metadata may contain copies of a reference to a storage location for the data (i.e., many-to-one), thereby requiring updates to each copy of the reference when the location of the data changes, e.g., a “cleaning” process. This contributes significantly to write amplification as well as to system complexity (i.e., tracking the references to be updated).
Some types of SSDs, especially those with NAND flash components, may or may not include an internal controller (i.e., inaccessible to a user of the SSD) that moves valid data from old locations to new locations among those components at the granularity of a page (e.g., 8 Kbytes) and then only to previously-erased pages. Thereafter, the old locations where the pages were stored are freed, i.e., the pages are marked for deletion (or as invalid). Typically, the pages are erased exclusively in blocks of 32 or more pages (i.e., 256 KB or more). This process is generally referred to as garbage collection and results in substantial write amplification in the system.
In addition, the “on-disk” layout of the data structures in the storage containers (i.e., on the SSDs) may create a plurality of odd-shaped random “hole” (i.e., deleted data) fragments adjacent to data. This fragmented data (i.e., data with interposed holes) may not facilitate natural alignment boundaries for Redundant Array of Independent Disk (RAID) configurations, thus raising problematic RAID implications. For example, if an attempt is made to write data into the odd-shaped fragments, it may be difficult to achieve good RAID stripe efficiency because partial stripes may be written, causing increased write amplification due to increased parity overhead.
Yet another source of write amplification in the system may involve RAID-related operations. Assume a dual parity RAID implementation that may include a plurality of data SSDs and two parity SSDs. A random write operation that stores write data on a data SSD of a RAID stripe may result in a plurality of read-modify-write (RMW) operations that, e.g., updates the data SSD with write data and updates the two parity SSDs with parity information after reading a portion of the write data and/or parity information. Such RAID-related operations results in a substantial amount of write amplification to the system.
Therefore, it is desirable to provide a file system that reduces various sources of write amplification from a storage system, wherein the sources of write amplification include, inter alia, 1) storage location reference updates; 2) internal SSD garbage collection; 3) partial RAID stripe operations from fragmented data; and 4) RMW operations from RAID organizations of data and parity.