The present application relates generally to an improved data processing apparatus and method and more specifically to mechanisms for a two-level log structured array (LSA) architecture with minimized write amplification.
Performance characteristics of NAND flash-based solid-state disks (SSDs) are fundamentally different from traditional hard disk drives (HDDs). Typically, data are organized in pages of 4, 8, or 16 KiB sizes. Page read operations are typically one order of magnitude faster than write operations, and unlike HDDs, latency depends on neither current nor previous location of operations. However, memory locations must be erased prior to writing to them. The size of an erase block unit is typically 256 pages. The erase operations take approximately one order of magnitude more time than a page write operation. Due to these inherent properties of the NAND flash technology, SSDs write data out-of-place and maintain a mapping table that maps logical addresses to physical addresses, i.e., the logical-to-physical table (LPT).
As flash chips/blocks/pages/cells might expose errors or completely fail due to limited endurance or other reasons, additional redundancy must be used within flash pages (e.g., error correction code (ECC) such as BCH) as well as across flash chips (e.g., RAID-5 or RAID-6 like schemes). While the addition of ECC in pages is straightforward, the organization of flash blocks into RAID-like stripes, referred to as “block stripes,” is more complex because individual blocks have to be retired over time requiring either reorganizing the stripes or shrinking the capacity of the affected stripe.
Garbage collection must be performed on block stripes, rather than blocks, in order to reconstruct the data in case data in one block are lost. This organization of stripes, together with the LPT defines the placement of data. SSDs today use a log structured array (LSA) architecture, which combines these two techniques.
In write-out-of-place, a write operation will write new data to a new location in flash memory, thereby updating the mapping information and implicitly invalidating data at the old location. The invalidated data location cannot be reused until the entire block is garbage collected, which means any still valid data in the block must be relocated to a new location before the block can be erased. Garbage collection (GC) of a block is typically deferred as long as possible to reduce the number of valid pages that must be relocated. Upon garbage collection, pages that have to be relocated cause additional write operations; this is often denoted as write amplification.
Due to limited endurance of NAND flash devices, the reduction of write amplification is very important. In fact, with shrinking technology nodes in NAND flash, endurance is dropping, hence making any sort of write reduction or write elimination even more important. Note that the garbage collection unit of operation depends on the implementation details of the flash management logic, ranging from a flash block in a simple flash controller to a RAID stripe of flash blocks, referred to as a “block stripe,” in case the flash controller implements RAID functionality at the flash channel level, or any other organization of flash blocks (e.g., Reed-Solomon codes) that the flash controller implements.
Existing flash arrays on the market include a set of independent flash nodes or SSDs connected to a RAID controller. The flash nodes operate independently of each other and manage the flash memory space in an LSA fashion. The RAID controller therefore does not see physical block addresses (PBAs) of the flash directly, but logical addresses referred to herein as node logical block addresses (nodeLBAs). Hosts access the flash array through a peripheral control interface express (PCIe), Fibre Channel, or similar interface that connects to the RAID controller. The RAID controller maps the host logical block address (hostLBA) space seen by the hosts to a nodeLBA space in an implicit way that does not require maintaining a mapping table. This requires no additional metadata or control structures. A logical block address such as a hostLBA or nodeLBA typically addresses a data storage unit of 4 KiB, 8 KiB, or 512 Bytes, and hence is not related to the Flash block size. These data storage units are also known as logical pages where one or more, entire or partial logical pages fit into a physical flash page. Also, the RAID controller does write-in-place updates as the LSA in each node below performs flash management functions transparently. However, in the case of small random writes, partial stripe writes cause two write operations for each user write operation: one for the data and another for the updated parity. As a result, for RAID-5 like schemes, small random writes add a factor of close to two to the system write amplification.
If user writes are written to the nodes in an LSA fashion inside the RAID controller, data to be written can be grouped into containers to reduce write amplification. A container may build a single RAID stripe. As updated logical pages are written as full stripe writes by the RAID controller, the above mentioned write amplification from RAID-5 is significantly reduced. For instance, if sixteen flash nodes are used, a RAID-5 scheme with fifteen data and one parity strips can be used, reducing write amplification to 1/15.
Besides the ability to significantly reduce write amplification, stacking two levels of LSAs is also beneficial for data deduplication. This is due to the fact that an implicit mapping of hostLBAs to nodeLBAs would typically only allow less efficient node-level deduplication because deduplication is performed in each node independently. With an array-level LSA, deduplication can be performed at the array level. The higher level LSA knows on which flash node the data of a particular deduplication object or logical page is stored.
Although simple stacking of two LSAs—one on the array level and the other in the flash nodes—seems to be promising, there are significant problems related to such an approach. One approach may be to organize the container size to match the geometry of the underlying node stripe size such that when higher level GC does garbage collect a container, the corresponding underlying flash blocks are fully invalidated. Higher-level GC would do all relocation work while the lower-level GC would always see entirely invalid blocks being garbage collected. Unfortunately, even if the underlying geometry is known, those stripe sizes may be of variable length due to flash blocks being retired over time or failed planes (i.e., variable stripe RAID). For off-the-shelf SSDs, the geometry is usually unknown.
As it is very difficult to align the container size to the underlying node geometry, one can also simply perform garbage collection on the RAID controller and inside each node independent from each other. As those garbage collectors are running independently, additional write amplification is created. Worse, overprovisioning may be required on both levels, which wastes flash space or increases write amplification further.