Many applications may be running on the host and each application may require access to data on the hard drives. The data flow between a host application and the hard drives is referred to as an input/output (IO) operation. Storage controllers typically control the flow of data between a host and storage devices such as hard drives. Storage controllers perform data processing operations such as cyclical redundancy check (CRC) calculations for data integrity, encryption for security, parity calculation for RAID applications, etc. These data processing operations are usually done by dedicated hardware engines within the storage controller device.
The processing rate of the hardware engines determines the overall system IO processing rate. Simple bandwidth analysis reveals that the engines form the bottleneck. Consider an 8-lane PCIe Gen 3 link as the interface between the host and storage controller: PCIe bandwidth=8*(8 Gbps)=64 Gbps=8.0 GBps. Assuming that 95% of the PCIe bandwidth is used for actual data, the available bandwidth: Available PCIe bandwidth=0.95*8.0 GBps=7.6 GBps. Assuming 16 SAS 2.0 lanes as the interface between storage controller and the hard disks: SAS bandwidth=16*(6 Gbps)=96 Gbps=12 GBps. Assuming that 85% of the SAS bandwidth is used for actual data, the available bandwidth: Available SAS bandwidth=0.85*12 GBps=10.2 GBps. Now consider using a hardware engine for calculating a CRC Data Integrity field (DIF). If the engine has a 64-bit data bus and is operating at 300 MHz, then the maximum processing rate per engine=(8B)*300 MHz=2.4 GBps. Thus the hardware engine is the performance bottleneck for a storage controller which works with high speed interfaces.
One way to address this performance bottleneck is to use multiple instances of the hardware engine such that the required processing bandwidth can be distributed across the multiple instances. Depending on the interface bandwidth, the system can be scaled to meet different performance requirements. For example, in an IO processing system which uses high speed interfaces, such as a PCIe (PCIe Gen 3, 8 Gbps) for host interface and SAS (SAS 2.0, 6 Gbps) for disk interface as described above, where the IO processing rate is limited by the speed of the data processing engine, multiple data processing engines can be integrated into the system to match the processing throughput with the interface throughput. In the above example, to meet the 7.6 GBps PCIe bandwidth, at least four instances of the hardware DIF engine would be required.
However, the use of multiple engines requires extra mechanisms to distribute the data processing tasks across the engines. In command based IO processing systems, this presents additional complexities of preserving IO coherency while distributing the processing across multiple engines. In such systems, the data flow is split into multiple small frames of data and separate commands are created by the IO processor (IOP) describing how each data frame needs to be processed. For example, consider an IO operation where 64 KB of raw data need to be transferred from host to disk and an 8 byte CRC DIF needs to be inserted after every 4 KB of data. For such an IO operation, the IOP, which controls the storage controller device, may initiate multiple DMA transfers, each transfer moving 1 KB of data from host memory into on-chip memory. The IOP will then create commands for the DIF engine to process each of the 1 KB data blocks. These commands are loaded into a command queue.
Since each 1 KB block represents a fraction of one full sector (4 KB) on which the CRC is to be calculated, there needs to be a global structure per IO operation called “IO context” which holds intermediate CRC results obtained after every 1 KB of data. The partial result at the end of the first 1 KB needs to be updated in the IO context before the second block can start processing. The CRC for the second block is calculated starting with the partial CRC obtained from the first block. This means that the commands of the same IO operation need to be processed in sequence. This also implies that the command scheduler should not schedule two commands of the same IO in parallel onto different DIF engine instances at the same time.
Since multiple applications are running in parallel on the host, there will typically be multiple IO operations requesting the same hardware operation. Thus the commands for different IO operations will be randomly interleaved in the command queue. In pure FIFO scheduling, the commands are popped out of the command queue and scheduled to free engines in order. This works well if all commands are independent of each other. However, in operations like CRC DIF computation, there are inherent dependencies between successive commands of the same IO flow. Hence, two commands belonging to the same IO operation cannot be scheduled onto different engines at the same time. This is shown in FIGS. 1A-1D. FIG. 1A shows a multi-engine system with four engines (E0-E3), and a command queue holding five commands from four separate IO flows (IO1-IO4), also referred to herein as input streams or input queues. The engines are non-pipelined. FIG. 1B shows a head of queue command (IO1_C1) scheduled on engine E0. In FIG. 1C, a next command (IO2_C1) scheduled on engine E1. The next command in the command queue is IO2_C2, which cannot be scheduled until the processing of IO2_C1 has completed. As shown in FIG. 1D, the head of the line is blocked, and engines E2 and E3 remain idle until command IO2_C1 is processed. Thus, if the command at the head of the command queue cannot be scheduled because of IO dependency, then all other commands in the command FIFO will be blocked (“head of line blocking”). This results in engines being underutilized and wasted processing bandwidth.
Data processing engines also typically have internal pipeline stages to improve performance. A simple example of a data processing engine 100 with two pipeline stages is illustrated in FIG. 2. The processing engine 100 is command driven, and includes a command buffer stage 102, and a command execution stage 104, within a data processing block 106. A command fetch block (not shown) can fetch commands from the command memory 108 and feed the command into the engine 100. The commands are processed in the command execution stage 104 of the engine 100. While a command is being processed in the execution stage 104, the next command is buffered in the command buffer stage 102.
In addition to the pipeline stages inside the processing engine, there may be pipeline stages outside the engine. A command pre-fetch stage 110, and a command output stage 112, which can respectively buffer input commands and output commands, are shown.
The command memory ports may be shared by multiple masters and the access latency of the memory may vary based on the total number of requests that are active. In order to decouple the engine 100 from the variable latency of the command memory 108, additional pipeline stages may be added on the engine command interface. For example, a command pre-fetch stage 108 can be used to pre-fetch the command from the command memory 108 to decouple the engine 100 from the latency of the command memory 108. A command output stage 112 can be used to hold the completed command from the engine 100 until it is written into the output command memory 114.
A loopback path is generally provided for the IO context from the command execution stage 104 and the command output stage 112 to the command pre-fetch stage 110. If the command in the command pre-fetch stage 110 belongs to the same IO as that of the command in command execution stage 104, then the command pre-fetch stage 110 must wait until the processing completes in the command execution stage 104. After the command in command execution stage 104 completes, the IO context is updated and ready for use by command in command pre-fetch stage 110. The IO context can be internally looped back from command execution stage 104 to command pre-fetch stage 110 without having to write back to command memory 108. Similarly, if the command in command pre-fetch stage 110 and command output stage 112 are of the same IO, the IO context can be looped back from command output stage 112 to command pre-fetch stage 110. The pipeline architecture of the engines introduces additional complexities for scheduling commands.
It is, therefore, desirable to provide an improved method of scheduling commands in multi-engine system.