Flash memory is one type of non-volatile, rewritable memory commonly used in many types of electronic devices, such as USB drives, digital cameras, mobile phones, and memory cards. Flash memory typically stores information in an array of memory cells made from floating-gate transistors. In traditional single-level cell (SLC) devices, each cell stores only one bit of information. Some newer flash memory, known as multi-level cell (MLC) devices, can store more than one bit per cell by applying multiple levels of electrical charge to the floating gates of memory cells.
A NAND flash memory (referred to herein as “NAND memory”) is accessed by a host system much like a block device such as a hard disk or a memory card. Typically, the host system performs reads and writes to logical block addresses. A NAND memory is typically divided into blocks and each block is generally organized into pages or sectors of cells. Blocks may be typically 16 KB in size, while pages may be typically 512 or 2,048 or 4,096 bytes in size. Multi-level NAND cells makes management of NAND memories more difficult, particularly in multithreaded real-time run-time environments.
In response, manufacturers have encapsulated NAND memories as memory devices in which a controller is placed in front of a raw NAND memory. The purpose of the controller is to manage the underlying physical characteristics of the raw NAND memory and to abstract the interface as a logical block device. This allows the NAND memory to provide a logical to physical translation map between logical block addresses (which are being accessed by a host system) and physical locations in the NAND memory, and to manage rules governing the logical to physical translation mapping internally via firmware in a NAND controller.
Reading and writing are asymmetric behaviors in NAND memories. To read a particular physical block, the address is programmed, and the read operation started. After an access time, the data is available. This process of reading blocks can be repeated ad infinitum (ignoring certain NAND disturb phenomenon). Writing blocks is an asymmetric operation because a given block can only be written with data essentially only one time, and so a write is not repeatable like a read.
The initial condition of a NAND cell is to store a logical ‘1’. To write a data value, wherever there is to be a ‘0’, the data is written and the ‘1’ states are left alone. While it may be possible to continue to overwrite ‘1’ states with ‘0’ states, this is not generally useful. To completely enable the overwriting of a block, the initial condition must be again established. This operation is referred to as an erase cycle.
Using currently available NAND memories as an example, typical read access times are in the range of 25-50 microseconds, write cycle times are in the range of 200-700 microseconds, and erase cycle times are in the range of 2,000-3,000 microseconds. Clearly there is a tremendous variance in performance, depending on the exact circumstances.
In order to mitigate the vast difference between erase and read cycle times, write blocks are typically grouped together into erase blocks so that the time to erase is amortized over many write blocks, effectively reducing the erase time on a per page basis. In addition, generally more read operations can be performed on a block than erase/write cycle pairs. While there are technological subtleties, generally reads are non-destructive. Because of the nature of the charge storage on the floating gates, erase/write cycle pairs tend to damage the storage cells due to trapped charge in the oxide of floating gate transistors. For this reason, erase/write cycle pairs should be algorithmically avoided, or when inevitable should be balanced across all blocks. This later mechanism is referred to as “wear leveling”.
Because of the impracticality of overwriting data (both because of the wear mechanism and erase block grouping), various techniques are used to virtualize the location of any given logical block. Within the current state of the art is what is called a file translation layer (FTL). This is a driver level software layer which maintains temporary and permanent tables of the mapping between a given logical block number and a physical location of the block in the media. By presenting a logical block device to upper layers of software, any number of file systems may be implemented. Alternatively, a journaling file system may be implemented using the linear array of blocks. Here, the blocks are allocated in order of need and the device block allocation is managed as (essentially) a large circular buffer.
As alluded to above, data on NAND memories can be written in a unit of one page, but an erase is performed in the unit of one block. A page can be written only if the page is erased, and a block erase will clear the data on all pages associated with a given block. Because a NAND memory is write-once, pages are allocated in a block until all the pages in the block are used. Regardless of the specific implementation, obsolete or “overwritten” data in the NAND array is not truly overwritten but simply marked by the number of mechanisms as simply being obsolete or stale. Logically, a block that contains live data is called a valid block, and an “obsolete” block is one that contains obsolete or stale data. If a file is written too many times, for example, it may result in many obsolete blocks in the NAND array.
When all (or nearly all) blocks contain data, blocks that have been written earlier may possibly contain stale data and therefore invalid data. When the NAND memory is full or almost full, it becomes necessary to remove the stale data and efficiently pack the remaining valid data to make room in the NAND memory. This process is referred to as “garbage collection”.
FIG. 1 is a block diagram illustrating a conventional garbage collection on a NAND memory 10. The garbage collection process on the NAND memory 10 includes a pre-collection phase 12 and post collection phase 14. During the pre-collection phase 12, all the blocks to be erased, called erase blocks, are examined. Blocks that are stale are available. Blocks that are not stale must be made stale by moving data in the blocks, i.e., rewriting the data into a new area. Erase blocks to be erased in a group comprise an erase cluster 16. In this example, the erase cluster 16 includes three valid blocks and one obsolete block 18. The valid blocks are being moved to respective blocks in free cluster 20. For this reason, garbage collection is not done when the NAND memory 10 is truly full, but is instead done when the block allocation crosses some threshold determined by file translation management requirements. After all blocks are made stale in the erase cluster 16, the blocks are erased and made available during post collection 14, resulting in free cluster 22. The new beginning of the log 24 is the end of the free cluster 22, and the new end of the log 26 is that last block that was moved.
Because garbage collecting an erase block involves read-then-write operations—first the block must be read to determine its current state and may involve data movement (i.e., writing good data elsewhere to make the current block stale) it can be quite time consuming to perform. The garbage collection time is the sum of the erase time, the summation of the rewritten block and the summation of the other reads necessary to determine the block state. If erase blocks are garbage collected in groups/clusters as shown in FIG. 1, this erase time is yet again increased proportional to the number of blocks being garbage collected.
Because it is not necessarily predictable to an application, operating system (OS), or a file system when a block driver needs to perform garbage collection, any throughput analysis must be able to tolerate a reasonably large asynchronous interruption in performance for the above described garbage collection. This is particularly true because in conventional systems, garbage collection is likely to be delayed until necessary.
In addition, a NAND memory (e.g., a NAND memory comprising a controller) typically maintains a translation cache to improve the speed of the logical to physical translations. The translation cache has a fixed number of entries that map the logical block addresses onto the physical addresses (e.g., NAND pages). The translation cache can be searched by a requested logical block address and the result is the physical address. If the requested address is present in the translation cache, the search yields a match very quickly, after which the physical address can be used to access the NAND memory. If the requested address is not in the translation cache, the translation proceeds by reading translation tables, which contain a larger set of translation entries, and are slower to access. These additional reads can delay performance of the NAND memory.
For a single threaded application, such as in a digital still camera, NAND memory performance can be optimized according to the usage model, and with currently available products in the memory category (e.g., Compact Flash and SD Card) often are. The camera usage model is to: 1) format a flash card; 2) take a picture, writing the data to the card as fast as possible (to minimize click-to-click time); 3) view random pictures to perform edits (e.g., deletion of unwanted pictures); and 4) mass transfer of picture files to another host (such as a desktop or laptop computer). Only steps 2) and 4) have real time performance requirements, and the usage of the storage is highly focused. When writing a new picture to the NAND memory, all the NAND memory has to do is be able to sustain sufficiently high write bandwidths. Conversely, when the NAND memory has to read picture files to transfer to a host, all the NAND memory is required to do is sustain sufficiently high read bandwidths.
However, on more complex platforms where there may be multiple streams being read and written to the NAND memory, and each stream may have its own characteristics including real-time requirements. Therefore, optimization is not nearly so simple because there are conflicting requirements.
Consider as an example, a multithreaded environment in which two software applications are processing three file streams. A first application is recording a real-time media stream (either video or audio) onto the NAND memory, while the same application is also playing back either the same or a different media stream. (If the first application is playing back the same media stream, the first application plays back the media stream at an earlier time point in the stream.) Assume that the second application is an e-mail client that is receiving e-mail updates over an internet connection and synchronizing the in-box.
In this example, these two applications have different real-time requirements. The media streaming performed by the first application cannot be halted, whereas the e-mail synchronization performed by the second application has no a priori timing requirement. If the media stream write overflows, data will be lost. If the media stream read underflows, there will be gaps in the video or audio playback. If there are delays in the e-mail synchronization, however, the performance will be affected, but since this is demand driven, there is no loss of data.
Typically, media streams are taken from some kind of media source (e.g., over-the-air modem or stored media) at a constant packet rate. These packets may be stored into a ping-pong buffer to make the system resilient to variable latencies in some operations. Media stream data is written into the ping buffer until the ping buffer is full, then the media stream data is written into the pong buffer. When the ping buffer is full, the media stream data is read out and passed along to the next stage in the processing pipeline (e.g., the ping buffer is emptied by software which stores the data onto the NAND memory). If the pong buffer is not empty by a consumer by the time the producer is finished loading the ping buffer, there is an overflow situation. If the consumer needs the ping buffer before the ping buffer has been filled, there is an underflow situation.
Large asynchronous garbage collection operations of NAND memories may complicate the real-time needs real-time applications, such as in the media stream example. Garbage collection represents a worst case deviation in the typical write access times to NAND memories, and this deviation can be extreme when compared to the typical result. Translation cache misses, while not as disruptive as garbage collection, also add to the performance uncertainty of NAND memories. The above scheme of using ping/pong buffers can accommodate large and variable latencies only if these latencies are bounded, and these buffers can do so at the expense of becoming very large. This places an additional burden on the platform in that the platform now requires very large media buffers in order to accommodate an operating condition that rarely occurs.
NAND memories lack an overall context to globally optimize garbage collection and translation cache pre-fetch processes because NAND memories do not have knowledge of the semantics of a given block operation.