A typical data storage system includes a host system, a storage controller that is in communication with the host system, and a storage array that is in communication with the storage controller. The storage array typically includes an array of physical disks (PDs), which may be hard disk drives (HDDs), solid state disks (SSDs) or similar persistent storage units. A storage array can be configured to allow large amounts of data to be stored and accessed in a very efficient manner. FIG. 1 illustrates a block diagram of a typical data storage system 2. The system 2 includes a host system 3, a storage controller 4, a peripheral interconnect Express (PCIe) bus 5, and an array of PDs 9. The host system 3 includes a system central processing unit (CPU) 11 and a system memory device 12. The storage controller 4 includes a controller CPU 6, a controller dynamic random access memory (DRAM) device 7, and an I/O interface device 8. The I/O interface device 8 is configured to perform data transfer in compliance with known data transfer protocol standards, such as the Serial Attached SCSI (SAS) standard, the Serial Advanced Technology Attachment (SATA) standard, or the Nonvolatile Memory Host Controller Interface Express (NVMe) standard. The I/O interface device 8 controls the transfer of data to and from the PDs 9.
The storage controller 4 communicates via the bus 5 with the system CPU 11 and with the system memory device 12. The system memory device 12 stores software programs for execution by the system CPU 11 and data. During a typical write action, the system CPU 11 runs a memory driver software stack 14 that stores commands and data in the system memory device 12. The commands and data are subsequently transferred via the bus 5 to the storage controller 4 and written by the controller CPU 6 to the controller memory device 7. The storage controller 4 may include a direct memory access (DMA) device 13 for transferring commands and data from the system memory device 12 into the controller memory device 7.
When the controller CPU 6 completes a command or a string of commands, it posts a hardware interrupt to the host system 3 to notify the host system 3 that the command or string of commands has been completed. In storage systems that use the Nonvolatile Memory Express (NVMExpress) PCIe Host Controller Interface specification, interrupts are posted to the host system 3 by the storage controller 4 to inform the host system 3 that the storage controller 4 has completed a number of commands and has posted associated completion entries in associated Completion Queues (CQs) located in the system memory device 12.
The storage controller 4 may be configured to post interrupts in multiple modes. Among these modes, a mode known as Message Signaled Interrupts (MSI) or (MSI)-X, the latter of which is defined in a standard known as PCI 3.0, is recommended to enable higher performance, lower latency, and lower CPU utilization for processing interrupts. MSI-X allows a device to generate multiple separate interrupts which are managed by a host on a per-vector basis. Each interrupt is associated with an interrupt vector and each interrupt vector is associated with one or more CQs in the system memory device 12. For each interrupt vector, the host maintains a respective MSI-X mask that comprises bits that are set or cleared by the host system 3.
When the storage controller 4 completes a command, it posts a completion in an associated CQ of the system memory 12 and posts an MSI-X interrupt to the host system 3 via a Memory Write Transaction Layer Packet (MWr TLP). The host system 3 maintains the MSI-X mask. When the MSI-X mask associated with an interrupt vector is set, this informs the storage controller 4 that it is not to send any interrupts associated with that particular interrupt vector to the host system 3. When the MSI-X mask associated with an interrupt vector is cleared, this informs the storage controller 4 that it is allowed to send an interrupt associated with that particular interrupt vector to the host system 3.
Due to the asynchronous nature of completion processing in data storage systems that use NVMe controllers, these systems are susceptible to spurious hardware interrupts. Most spurious interrupts occur when there is a race condition between the host system software configuring the interrupt-related registers and the storage controller sending out the interrupt. The asynchronous relationship between the device's sending of MSI-X interrupts via Memory Write Transaction Layer Packets (MWr TLPs), along with the host's masking of the MSI-X interrupts via MWr TLPs, along with the host's opportunistic process of completions create race conditions that allow interrupts generated by the device to become spurious. Spurious interrupts can cause software programs being executed by the host CPU to misbehave or operate inefficiently. For this reason, storage controller designers always try to design storage controllers to minimize the occurrence of spurious interrupts. Spurious interrupts, however, are almost impossible to eliminate completely.
For example, a first scenario that can result in the occurrence of a spurious interrupt can result when the host CPU 11 is processing completions contained in a CQ in best-efforts manner. For this scenario, it will be assumed that host system 3 has sent commands A and B sequentially to the storage controller 4. When the storage controller 4 has completed processing of commands A and B, the storage controller 4 posts a completion of command A to the host system 3 and a completion of command B to the host system 3, both of which are stored in the same CQ. When the storage controller 4 posts the completion of command A, the storage controller 4 composes a first interrupt and posts the first interrupt to the host system 3 to inform it that a completion for command A is in the CQ.
When the storage controller 4 posts the completion of command B to the CQ, it composes a second interrupt to inform the host system 3 that the controller 4 has posted a completion for command B in the CQ. However, if the MSI-X mask associated with that interrupt vector is already set, the controller 4 does not post the second interrupt to the host system 3, but maintains the interrupt as pending in the controller 4. In addition, because the host system 3 is processing completions in the CQ on a best-efforts basis, it is possible that the host system 3 will process the completions associated with both commands, even though the second interrupt was never posted to the host system 3. Once host system 3 clears the corresponding MSI-X mask, the storage controller 4 may immediately send out the pending interrupt because of the pending interrupt status in the controller 4. However, in this scenario, the host system 3 has already processed both of the completions, and therefore the interrupt associated with the completion of command B would be spurious.
An example of a second scenario that can result in the occurrence of a spurious interrupt is when the host system 3 sets the MSI-X mask and the storage controller 4 leaks out an interrupt that was posted before the controller 4 was aware that the mask was set and was travelling through the hardware pipeline. This can happen if, for example, the storage system 2 has multiple storage controllers 4 that post interrupts in different ways. For example, one controller 4 may compose an interrupt before checking the associated MSI-X mask whereas another controller 4 may check the associated MSI-X mask before composing an interrupt. This is more likely to be a problem in cases where one of the controllers 4 incurs a lot of latency in composing an interrupt due to tasks associated with composing the interrupt such as, for example, fetches, arbitration, etc. In these types of situations, it is possible that the MSI-X mask bit was set after the controller 4 began composing an interrupt, but before the controller 4 posted the interrupt to the host system 3. Consequently, the interrupt becomes spurious. In some cases, the controller 4 may be configured to confirm that the MSI-X mask bit is not set as a last step prior to sending out an interrupt. While this may reduce the chance of sending out a spurious interrupt, it can be a waste of power and bandwidth if the MSI-X mask bit is already set when the confirmation process is performed.
Accordingly, a need exists for a system and method for reducing spurious interrupts in a data storage system.