The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for a two-level log structured array (LSA) architecture using coordinated garbage collection for flash arrays.
Performance characteristics of NAND flash-based solid-state disks (SSDs) are fundamentally different from traditional hard disk drives (HDDs). Typically, data are organized in pages of 4, 8, or 16 KiB sizes. Page read operations are typically one order of magnitude faster than write operations, and unlike HDDs, latency depends on neither current nor previous location of operations. However, memory locations must be erased prior to writing to them. The size of an erase block unit is typically 256 pages. The erase operations take approximately one order of magnitude more time than a page write operation. Due to these inherent properties of the NAND flash technology, SSDs write data out-of-place and maintain a mapping table that maps logical addresses to physical addresses, i.e., the logical-to-physical translation (LPT) table.
As flash chips/blocks/pages/cells might expose errors or completely fail due to limited endurance or other reasons, additional redundancy must be used within flash pages (e.g., error correction code (ECC) such as BCH) as well as across flash chips (e.g., RAID-5 or RAID-6 like schemes). While the addition of ECC in pages is straightforward, the organization of flash blocks into RAID-like stripes is more complex because individual blocks have to be retired over time requiring either reorganizing the stripes or shrinking the capacity of the affected stripe. This organization of stripes together with the LPT defines the placement of data. SSDs today utilize a so-called log structured array (LSA) architecture, which combines these two methods.
In write-out-of-place, a write operation will write new data to a new location in flash memory, thereby updating the mapping information and implicitly invalidating data at the old location. The invalidated data location cannot be reused until the entire block is garbage collected, which means any still valid data in the block must be relocated to a new location before the block can be erased. Garbage collection (GC) of a block is typically deferred as long as possible to reduce the number of valid pages that must be relocated. Upon garbage collection, pages that have to be relocated cause additional write operations; this is often denoted as write amplification.
Due to limited endurance of NAND flash devices, the reduction of write amplification is very important. In fact, with shrinking technology nodes in NAND flash, endurance is dropping, hence making any sort of write reduction or write elimination even more important. Note that the garbage collection unit of operation depends on the implementation details of the flash management logic, ranging from a flash block in a simple flash controller to a RAID stripe of flash blocks, referred to as a “block stripe,” in case the flash controller implements RAID functionality at the flash channel level, or any other organization of flash blocks (e.g., Reed-Solomon codes) that the flash controller implements.
Existing flash arrays on the market include a set of independent flash nodes, flash cards, or SSDs connected to a RAID controller. The flash nodes operate independently of each other and manage the flash memory space in an LSA fashion. The RAID controller therefore does not see physical block addresses (PBAs) of the flash directly, but logical addresses referred to herein as node logical block addresses (nodeLBAs). Hosts access the flash array through a peripheral control interface express (PCIe), Fibre Channel, or similar interface that connects to the RAID controller. The RAID controller maps the host logical block address (hostLBA) space seen by the hosts to a nodeLBA address space in an implicit way that does not require maintaining a mapping table. This requires no additional metadata or control structures. A logical block address, such as a hostLBA or nodeLBA, typically addresses a data storage unit of 4 KiB or 512 Bytes, and hence is not related to the Flash block size. Also, the RAID controller does write-in-place updates as the LSA in each node below performs flash management functions transparently. However, in the case of small random writes, partial stripe writes cause two write operations for each user write operation: one for the data and another for the updated parity. As a result, small random writes add a factor of close to two to the system write amplification.
When data are written in a full stripe, only one single additional write operation is generated for N host writes and system write amplification is reduced to (N+P)/N, where N corresponds to the number of data stripes and P to the number of parity stripes in a RAID stripe. With a seven node plus one parity array, N=7 and P=1, resulting in significantly lower write amplification of 1.14. Therefore, to reduce write amplification, it is beneficial to write entire stripes. If the user writes are written to the nodes in an LSA fashion inside the RAID controller, data to be written can be grouped into containers to minimize write amplification to full stripe writes. A container would typically hold a single or multiple RAID stripes and all containers would be of equal size. As those updated pages are written as full stripe writes by the RAID controller, the above mentioned write amplification from RAID-5 is significantly reduced compared to the implicit static hostLBA to nodeLBA address mapping.
Stacking two LSA architectures—one on the array level and the other on the flash nodes—is the straightforward approach to alleviate the write amplification due to the read-modify-write of the parity for partial stripe writes. However, the following issues must be addressed: (1) in a naïve approach, the total overprovisioning would be roughly doubled because each LSA level typically requires its own overprovisioning; and, (2) as the garbage collectors on each level operate independent of each other, data are relocated on each level resulting in additional writes and, hence, higher write amplification. In order to address these issues, the array-level container size should match and be aligned with the geometry of the underlying nodes' garbage collection unit (i.e., a stripe, assuming a RAID scheme is implemented at the node level as well, or a flash block otherwise) size such that array-level container writes always result in fully invalidated blocks at the node level. Higher level GC does all relocation work while entirely invalid blocks are garbage collected in the lower level. Unfortunately, even if the underlying geometry is known, the size of the nodes' garbage collection unit might be of variable length due to flash blocks being retired over time or failed planes (i.e., variable stripe RAID). For off-the-shelf SSDs, the geometry is usually unknown.
Because it is not always possible to align the container size to the underlying node geometry, a two-level LSA scheme performs garbage collection at both levels: on the RAID controller and inside each node. As those garbage collectors are running independent from each other, additional write amplification is potentially incurred. Worse, significant overprovisioning is required at both levels, which wastes flash space or further increases write amplification.