Mass storage systems have for many years been based on hard disk drives (HDDs) where spinning magnetic discs are accessed using data read/write heads mounted on an actuator arm which is moved in an arc across the discs. Several discs or platters are mounted on common spindle and a read/write head is provided for each disc surface, also on a common arm. To access a particular piece of data, the drive controller may need to move the actuator arm to the correct position on the disc, which may also involve waiting for up to one drive rotation to occur. The discs spin in unison and the heads also move in unison. Thus, while parallel access is possible, the surface of each disc in not independently accessible.
In spite of the limitations of non-independent access, improvements in the efficiency of data access have been made, such that a drive controller can maintain a queue of commands and decide for itself the optimal order in which to place the commands in order to minimize the movement of the actuator arm and the number of disc rotations that occur, and thereby minimize the access time to the data. Such a scheme is described in a joint whitepaper by Intel Corporation and Seagate Technology entitled “Serial ATA Native Command Queueing, An Exciting New Performance Feature For Serial ATA” published in July 2003. Native Command Queueing (NCQ) therefore re-orders the commands and returns the data, not in the order that the commands were originally queued, but in an order that causes the data to be returned more quickly, as the access time for the commands taken as a group is minimized. NCQ was incorporated into the ATA/ATAPI-7 standard in 2003.
FIG. 1. shows a schematic representation of a hard disk in the prior art. The disk platter 100 is segmented into concentric rings called tracks (110,120,130,140) where data may be written and read from. The disk is accessed through the rotation 200 of the platter and an actuator arm 190, with a read/write head at the tip 195. By way of example, data is shown located on tracks, with data sector 150 on track 110, data sector 160 on track 120, data sector 170 on track 130 and data sector 180 on track 140. If data commands to these sectors were issued in this order, the rotation of the platter and positioning of the read/write head would mean that the head would follow the path shown by the dotted line 197.
FIG. 2., shows a hard disk in the prior art with the same placement of data, but where NCQ (Native Command Queueing) is employed where the commands are first queued and then issued in a sequence which reduces the overall time spent moving from one sector to another. In this case, the path 190 taken by the head data shows that sector 150 is first transferred, then sector 170, and then sector 160 before finally sector 180. The path taken (and the time taken to traverse the path) in this case is much shorter than that shown in FIG. 1, and so the data for the sectors may be transferred at a greater overall data rate as the time spent moving from sector to sector is shortened.
In recent years, the use of non-volatile flash memory has become more and more prevalent. One use has been to build emulations of hard disk drives, known as Solid State Drives (SSDs), or as Solid State Disks. To allow SSDs to be used as a direct replacement for HDDs, they must offer the same interface and adhere to the same standards as HDDs which include Serial Advanced Technology Attachment (SATA) and ATA/ATAPI, which includes support for NCQ.
A typical SSD comprises a memory controller and a number of memory devices. The memory controller provides an interface to the host system on one side and multiple interfaces to memory devices on the other. The memory device interfaces can operate independently in parallel and are termed ‘channels.’ On each memory channel there may be attached multiple devices arranged on a parallel data bus which are termed ‘banks.’ While only one data transfer, read or write, may be active at any one time, memory operations which do not require data transfer may proceed in parallel. Therefore, one memory bank may be executing an erase operation, which does not require any data transfer, while another bank executes a program operation which only requires a short data transfer period compared to a longer programming period, while another bank executes a read operation during the erase and programming periods of the other banks.
In order to provide high speed data transfer, the controller will typically arrange that a single data access request from the host will be satisfied by multiple data accesses to multiple banks on multiple channels. This is similar to the concept of data striping used in RAID 0 systems, where a data transfer is split across multiple storage devices, allowing each storage device to transfer data in parallel, thereby effectively multiplying the data transfer rate by the number of storage devices used in parallel.
FIG. 3. shows a simplified schematic of a basic SSD where a flash memory controller 300 accesses a number of flash memory devices, 320-323, via single data channel 325. At any one time, only a single data transfer may take place between the controller and a memory device, 326, while in parallel, any number of operations may take place within the memory devices which do not involve data transfer, 327a, 327b and 327c. 
Having only a single data transfer active at any one time means that the I/O performance of such an SSD is severely restricted, certainly not enough to drive a high speed drive interface standard, such as SATA or SAS, to capacity. In order to improve the efficiency and data transfer rates, it is more common to drive multiple data channels in parallel, as shown in FIG. 3B. Here, the flash memory controller 300 connects to memory devices 320-323, 330-333, 340-343 and 350-353. Although only 4 separate channels are shown here, controllers with 8, 16 and 32 separate parallel flash channels have been implemented. Operating in much the same way as hard disks in a RAID configuration, independent simultaneous data transfers to 4, 8, 16 or 32 flash devices are possible, thereby increasing data transfer throughputs over what can be achieved to a single device on a single flash channel.
Unlike a hard disk drive, there is no particular benefit in locating data in different areas of memory (different memory devices) as the access time to any device is the same as any other. However, it may be beneficial, for example when updating or re-writing data, to perform an erase operation on a different device while transferring data to/from a device as erasing on one device and data transfer to/from another may be performed in parallel, as the former does not involve data transfer on the data channel.
Although NCQ was designed primarily to cope with the access latencies of HDDs, there are latencies associated with data access to SSDs, but these are as the result of using solid state memory devices instead of rotating magnetic discs. Therefore, systems and methods which may have been developed to optimize data access to HDDs are no longer relevant to optimize data access to SSDs.
There are numerous difficulties in employing an NCQ implementation in an SSD. Consider first a HDD implementation. When multiple commands are presented in a queue, in an NCQ implementation with an HDD, an algorithm may attempt to order the commands based on the starting logical block address of each command. A fixed mapping exists between the LBA and the platter, and physical location on the platter used for that data. Based on the current position of the actuator arm and the current rotational position, it can be calculated to a fair degree of certainty how long the actuator arm may take to move to a different position. From the starting position, the data transfer speed may also be calculated and also the ending position of the read/write head after the transfer. Therefore, the complete time taken to execute several commands in sequence may also be determined with a fair degree of certainty.
By contrast, in an SSD, there may be a mapping from an LBA to a physical position within a memory device, but in an SSD memory controller with multiple memory channels, the access time to a given memory bank on a given memory channel is largely independent of the bank or channel position, as electrically all the channels and banks are equivalent. Rather, the access time to a bank on a given channel is more likely to be affected by other operations already proceeding on other banks on that memory channel, and the data either already present in the memory or being presented for writing.
While some memory operations, such as a read data operation, may be fairly consistent and determinate in terms of elapsed time, other operations such as erase and programming may be quite variable in elapsed time. These variations may be due to the previous contents of the memory locations concerned when erasure (which is the first stage of writing data) is performed, or may depend on the data itself when programming (which is the second stage of writing data). This is due to the fact that erase requires the removal of charge from the memory cells and programming the adding of charge to the memory cells where both operations may be conducted in a series of discrete steps, and the number of steps will depend on the initial charge state of the cell in the case of erasure or final charge state in the case of programming.
When multiple memory operations proceed in parallel to multiple banks on the same channel, the processing of an operation on one bank can interfere with the processing and timing of another operation on a different bank. Also, given that a single data operation from the host may be split into several sub-operations on different channels, the processing and timing of these sub-operations on each channel may vary according to other sub-operations proceeding or queued in parallel on the same channel
Therefore, it becomes very difficult to model with any degree of certainty and accuracy the interactions of many sub-operations proceeding on multiple channels and banks, with the result that it might seem to be impossible to achieve any performance improvements by using NCQ in an SSD. The most efficient queueing and submission of data commands is the prime objective of NCQ when using an HDD, as once the command is submitted the writing or reading of data is subject to a determinate and predictable process to the point of completion. But with an SSD the command may be split into multiple sub-commands, each possibly directed to a different device or die within a device, which may be subject to different unpredictable delays and variations, with the result that the completion of each sub command may occur in a apparently random and unpredictable manner.
In recognition that SSDs behave much differently than HDDs and that using existing hardware interfaces and protocols designed originally for HDDs, such as SATA and ATA/ATAPI does not exploit the new characteristics of SSDs, a new protocol called NVM Express (NVM standing for Non Volatile Memory) was developed which is based on the PCI (Peripheral Component Interconnect) Express standard, commonly referred to as NVMe and PCIe respectively. The NVMe protocol specification defines provision for up to 65,535 separate queues each with up to 64K queue entries, which can exploit the parallelism that exists in SSDs
There is therefore a need for systems and methods which can optimize the data access to solid state memory devices in order to reduce the access latencies of a set of queued commands, particularly in NCQ or NVMe queueing implementations.