The present invention generally relates to non-volatile data storage devices for use with computers and other processing apparatuses. More particularly, this invention relates to NAND flash-based solid state drives and performance optimizations thereof.
Mass storage devices such as advanced technology attachment (ATA) drives and small computer system interface (SCSI) drives are rapidly adopting non-volatile memory technology, such as flash memory or another emerging solid-state memory technology including phase change memory (PCM), resistive random access memory (RRAM), magnetoresistive random access memory (MRAM), ferromagnetic random access memory (FRAM) or organic memories. Currently, the most common solid-state technology uses NAND flash memory components as inexpensive storage memory, often in a form commonly referred to as a solid-state drive (SSD).
NAND flash memory comprises chains of floating gate transistors that store information by injecting electrons into the floating gate via Fowler Nordheim tunneling. The floating gate then augments or counteracts the control voltage applied to the control gate. Consequently, the voltage level applied to the control gate that is needed to cause the transistor to switch to a close state is equivalent to the bit value stored in the floating gate transistor, which comprises one cell of NAND flash memory.
The earliest generations of NAND flash were able to store a single bit in each cell, that is, because of relatively crude programming and sensing technology, only two levels of floating gate charge could be distinguished. This type of NAND flash memory is still used and generally referred to as single level cell (SLC) flash memory. Despite only being capable of storing a single bit in each cell, the relatively low requirements on the accuracy for programming and sensing, in combination with the advances in the control logic of the NAND flash memory device allows the current generations of SLC NAND flash to have increased operating speeds, operate with extremely low bit error rates, and further exhibit improved write endurance characteristics.
The drawback of using SLC NAND flash memory is that the area bit density is low. In contrast, multi-level cell (MLC) flash memory can store two bits in each cell by decoding four different switching voltage levels and tri-level cells (TLC) can store three bits in each cell. The number of voltage levels that need to be distinguishable is 2n wherein n is the number of bits that can be stored in each cell. Accordingly, even though it is an apparent misnomer, TLC NAND flash needs to have enough programming and sensing granularity to unambiguously identify eight different switch voltage levels.
Instead of programming each MLC or TLC memory cell in a single sweep to the desired charge of the floating gate, the lower and upper bits are programmed separately. Each programming sweep creates a page. Lower bits of one programming cycle form a lower page referred to as a least significant bit (LSB) page. Upper or most significant bits (MSB) form a logically separate page, i.e., the upper page, in a subsequent programming cycle. In the case of TLC, a third level of granularity is added, resulting in a third page.
As discussed above, each higher level of bit and page requires exponentially more levels of voltage to be unambiguously identifiable. Inherently, this means higher precision of both programming and sensing as well as better immunity against level shifting through near-field effects and/or drifting of the floating gate charges because of leakage currents. In combination, these factors create the scenario where the higher precision and granularity comes at the expense of exponentially longer programming intervals with each additional level. In practice, in an exemplary NAND flash MLC integrated circuit, programming the lower page to the desired bit value may require 500 μsec, whereas programming the upper pages requires 1,650-2,100 μsec for the lower and upper plane corresponding to even or odd page numbers, respectively. In TLC, this trend continues with the highest level programming times reaching up to 4,000 μsec.
Modern flash controllers such as those deployed in SSDs typically use multiple channels to interface with the NAND flash memory array at the storage back-end. The different channels can operate as individual units or in unison but if several channels are working together as a group, their data transfers are synchronized. For example, if write or read operations are simultaneously executed over several channels, all channels will be part of the group and the controller will not issue a “Done” interrupt until all data are stable in the NAND flash. Inherently, this means that if data are written to a mixture of lower and upper pages and if the lower pages-writes are completed much faster than the upper page-writes then the group characteristics will force the faster (lower) channels to wait until the slower (upper) channels have completed the write command.
In most multi-channel flash controllers, the above discussed mixed group write comprising write commands to both lower and upper pages is not an exception but a typical situation, wherein a substantial amount of time is wasted in a “no channel left behind” implementation. Since all writes will be executed at the slowest channel's pace, write performance greatly suffers. With current MLC technology, this problem is only starting to emerge, however, at the latest with a more wide-spread acceptance of TLC NAND flash, this will create a serious write performance bottleneck. Accordingly, it is of utmost importance to develop new strategies to avoid dragging down of performance by speed mismatch of pages to be written to.
First iterations of MLC flash simply alternated lower and upper pages, for example, all lower pages had even page numbers, whereas all upper pages had odd page numbers. This simple interleaving of upper and lower pages has been superseded by more sophisticated page pairing patterns wherein typically at least two upper and two lower pages are paired in logically consecutive page numbers. Of particular importance in this case is the introduction of dual plane NAND flash integrated circuits wherein two physical pages are accessed in parallel through each read or write command. Optionally, higher numbers of functionally equivalent pages can be paired or a higher number of any subset of pages can be used to create an offset at the low end of each block. However, the pairing pattern for all NAND flash integrated circuits is at the discretion of the specific integrated circuit vendor and may vary across different designs.
A common programming bit storing pattern and programming sequence is represented in FIG. 1. A fully erased sequence of MLC NAND flash memory cells in which two bits can be stored in each cell is organized into pairs of pages, wherein Page X(l) refers to the lower page and Page Y(u) refers to the upper page. The first programming command will write to the lower page only, as represented by the change in bit value from 1 to 0 in the lower programmed page. As a result, the switching voltage level L0 will be changed to L1 above the read threshold R1 in cells having lower bits programmed from 1 to 0. The second programming command will subsequently write the upper bits to the upper pages, wherein, if the bit value is changed from 1 to 0, the voltage level is changed either from L0 to L3 or from L1 to L2, depending on whether the bit value of the lower cell was 1 or 0. All switching levels are separated from each other by reference points (R1-R3). Programming the upper pages takes substantially longer than programming the lower pages with additional latencies encountered for the odd pages located in the upper plane (plane 1).
In a typical write command, the host system writes several file system allocation units to the solid state drive. For purposes of this discussion, each file system allocation unit is considered to correspond to a page in a block of the NAND flash integrated circuit. The NAND flash controller features several channels, each of which is connected to a number of NAND flash integrated circuits, only one of which can be selected at any time via a chip enable signal and wherein the number of NAND flash integrated circuits that can be addressed by the flash controller equals the number of Chip Enable (CE) signal lines (unless bank-switching mechanisms expand the capacity of the array).
All channels enabled on a controller comprise a group and the highest performance for reading and writing data is achieved whenever an entire group is active, that is, all channels read or write simultaneously to one of the NAND flash integrated circuits.
One scenario often encountered in solid state drives using a plurality of channels for parallel access of multiple NAND flash integrated circuits is that the size of the data committed from the host does not align with the group boundaries. As a result, within the amount of time allowed for maintaining data within a volatile memory-based write combine buffer, not all channels are utilized and, by extension, not all pages within the write target blocks of the group are written to.
A single channel device is represented in FIG. 2. Within such a single channel device, comprising a single channel NAND flash controller with a single NAND flash integrated circuit, an initial write access of data will select one block of the array as the write target block and then write all pages within this block in sequential page order. The data written by the file system to logical page addresses (LPA AA-DD) are stored in a buffer which performs the logical to physical mapping (page 0/plane 0; P0/0) and commits the data to the lowest available page address in the write target block, which can be a single plane block or, as shown in the illustration, be configured as a dual plane block. In an empty or fresh block the starting physical page address will be page 0 and plane 0 (P0/0). When all pages in the write target block are filled, a new, empty block is selected as the next write target. It is important to notice that a new write target block is only selected if no pages are left in the previous block. This means that at any given time, within the array of flash blocks, all blocks are either full or empty and only one single block is in the process of filling up. This scheme applies to host writes as well as for internal data movement for the purpose of refresh or space reclamation and garbage collection.
Arguably, there are exceptions to this simplified rule in that, for example, some blocks are used to store meta data about the actual user data, however, in the context of the present invention, those special blocks are not relevant.
An exemplary four-channel device is represented in FIG. 3. If multiple channels of flash are used in parallel, the workload is split across all channels in that the first page is written to the first NAND flash integrated circuit on channel 0 (channel 0, chip enable 0; CE0), the next page to the first integrated circuit on channel 1 (channel 1, CE0) and so on until the entire group comprising all channels have been written to. At this point the cycle starts over and the next page is written to the next CE (CE1) on channel 0. However, the start of a new group write cycle has to wait until the last channel in any group has finished the write execution and generates a “DONE” interrupt. During this time, even if, for example, seven out of eight channels have completed the command, the seven channels have to wait for the running command to finish, which means that they will be sitting idle.
When random or small data sets are written to the flash array, then the individual pages of the respective write target blocks are simply written in a round robin scheme wherein only sub-maximal numbers of channels are active at any given moment. For example, in an eight-channel configuration, if three channels have been written to already, five additional page writes are necessary in order to bring all write targets back to the same page offset. As long as the subsequent writes constitute relatively small data followed by idle periods, there is a high chance that the write commands will fit into the “outstanding” channels, meaning that they will all be programmed in either LSB or MSB mode.
The situation is more complicated if relatively small amounts of data with partial utilization of the available channels in a group are followed by large amounts of sequential writes. For example, in FIG. 3, a chunk of data comprising 34 file system allocation units (0-33) is committed to the flash array. Each file system allocation unit corresponds to a page of the NAND flash memory integrated circuit. Since the end of the data chunk does not align with the group boundaries, two pages are written to CE0 of channel 0, which leaves six free pages with the same physical page address in the group. This physical page address is lower than the highest used physical page number in channel 0.
In this case, the sequential writes will start at the first available channel at page 34 and then wrap around to page 41 for full utilization of all channels regardless of whether the next available pages on the write target blocks are LSB or MSB pages. As a result, each write command will contain a mixture of LSB and MSB pages but will be forced to operate at MSB speed because the “DONE” interrupt can only be issued after all channels have completed the write cycle.
In view of the above, it can be appreciated that there are certain problems, shortcomings or disadvantages associated with the prior art, and that it would be desirable if an improved method were available for writing data to non-volatile solid state memory-based mass storage devices that was capable of at least partly overcoming or avoiding these problems, shortcomings or disadvantages.