A storage server is a special purpose computer system used to store and retrieve data on behalf of one or more clients on a network. A storage server operates on behalf of one or more clients to store and manage data in a set of mass storage devices, such as magnetic or optical storage-based disks or tapes. In conventional network storage systems, the mass storage devices may be organized into one or more groups of drives (e.g., redundant array of inexpensive disks (RAID)). A storage server also typically includes internal memory that is used as a buffer cache, to speed up the reading and writing of data from and to (respectively) the main mass storage system. In conventional storage servers, this buffer cache typically is implemented the form of dynamic random access memory (DRAM).
A storage server may be configured to service file-level requests from clients, as in the case of file servers used in a network attached storage (NAS) environment. Alternatively, a storage server may be configured to service block-level requests from clients, as done by storage servers used in a storage area network (SAN) environment. Further, some storage servers are capable of servicing both file-level and block-level requests, as is the case with certain storage servers made by Network Appliance, Inc. of Sunnyvale, Calif.
It is desirable to improve the performance of storage servers, and one way to do so is by reducing the latency and increasing the random access throughput associated with accessing a storage server's main mass storage subsystem. In this regard, flash memory, particularly NAND flash memory, has certain very desirable properties. Flash memory generally has a very fast read access speed compared to that of conventional disk drives. Also, flash memory is substantially cheaper than conventional DRAM and is not volatile like DRAM.
However, flash memory also has certain characteristics that make it unfeasible simply to replace the DRAM or disk drives in a storage server with flash memory. In particular, conventional flash memories, such as flash solid-state drives (SSDs), include an on-board memory controller which implements a data layout engine. The data layout engine typically implements a log based system to decide where data should be written in flash and to identify locations in flash where desired data is stored. This internal data layout engine adds a non-trivial amount of overhead to the processes of reading and writing data, which tends to offset the performance gains that could otherwise be achieved by using flash.
In addition, while flash memory generally has superior read performance compared to conventional disk drives, its write performance is generally not as good. One reason is that each time a unit of flash memory is written, it must first be erased, which adds latency to write operations.
Furthermore, the smallest individually erasable unit in a flash memory, which is called a “block”, is generally much larger than the smallest individually writable unit, which is called a “page”; for example, a typical page (minimum writable unit) may be 2 kB while a corresponding block (minimum erasable unit) is 64 pages (e.g., 128 kB). Consequently, if a single 2 kB page were to be modified in flash, that would involve first reading back the entire 128 kB block that includes the page, erasing the entire 128 kB block, and then writing the entire 128 kB block back, including the modified version of the 2 kB page. This process is extremely inefficient in terms of latency. Further, this process causes wear on the flash memory cells, which typically have finite lifespans in terms of the number of erases that can be performed on them before failure.
In addition, conventional flash memory used in SSDs requires that writes be done in sequential page order within a block (whereas reads can be random). The SSD internally translates random writes that it receives into sequential writes, which can dramatically lower the performance of the SSDs. Even if sequential writes are performed to an SSD, this translation layer is used, which can increase overhead per unit of performance.
Furthermore, while flash memory generally has very good read performance compared to conventional disks, the latency associated with reads is often highly variable within any given system, even for a given flash chip. When accessing an example of one of today's SSD, with a mix of random read and write operations, this behavior can be observed in that approximately 5% of all reads return in 2-4 msec, whereas the other 95% return in an average of 850 μsec or less. It is believed that this variability is caused by random accesses to a flash device which is in the process of erasing, causing a delay in access to the data. In the case of SSDs, the initial access is much longer; hence, the delay caused by the erase is not amplified as much as in raw flash, but it still exists.
This variability does not lend itself well to predictable system behavior. To understand the cause of this variability, consider how conventional flash memory is normally implemented. NAND-based flash memory shall be discussed here for purposes of illustration.
In NAND-based flash devices, data is read and written in units of pages but erased in units of blocks. The page size varies between devices, but currently the page size is 2 kB and expected to grow to 8 kB over the next few years. Block size is expected to grow similarly to maintain the 64 page per block ratio. Access to a flash memory occurs in two phases, which are referred to here as the operation and the data transfer. The data transfer is where data is transferred to or from an internal buffer in the flash chip, to the system over a bus interface on the flash chip. The operation can be defined as the transfer of data to or from the internal buffer to the NAND flash array or any of various other operations, such as erasing a block.
Most flash devices provide for some minimum level of concurrency between data transfer and operations, by providing two or more memory planes. This configuration requires that overlapped operations be targeted at different memory planes. Operations targeted to the same plane must be processed sequentially.
Consider now the following illustrative access latencies associated with conventional flash memory. A 2 kB data transfer of data may take approximately 50 μsec for either a read or write to load the internal data buffer on the flash chip, while a read page operation may take approximately 20 μsec for that same 2 kB data block, and a write page operation may take approximately 200 μsec for that same 2 kB of data. The erase operation, as mentioned above, may erase 64 pages, or 128 kB in about 2,000 μsec. A complete system read would take approximately 70 μsec to fully return the data. If another read were pending at that time, the total time would extend to 140 μsec. If a write or erase was in progress ahead of the read, the time could extend to 270 μsec in the case of a write or 2,070 μsec in the case of an erase. Having a 30-fold variability in the access time does not lend itself well to predictable behavior.
The above-mentioned performance times are based on floating gate NAND flash technology. Newer generation NAND flash devices are expected to be based on a charge trap design, which will allow smaller memory cells but at the cost of increased erase and write times. The increase in erase time may be many times that of current NAND flash devices. Such an increase will further exacerbate the read access time variability.