Parallel-task processors (having multiple hardware engines and/or capable of handling multiple threads per engine) are known for parallel processing tasks. Because it may be difficult or impossible in a given system for a single engine to process data fast enough to support a throughput requirement, parallel processing with multiple engines may be employed to meet a throughput performance target. Thus, the processing rate of the combination of hardware engines determines the overall system processing rate, which is commonly a bottleneck in the system's throughput.
FIG. 1 shows an exemplary storage system known in the art. The system 100 comprises an 8-lane PCIe (Peripheral Component Interconnect Express) Gen. 3 link 101 as the interface between a host 110 and a storage controller 111. The bandwidth of the PCIe link is 8.0 GB/s (8 lanes*8 Gb/s per lane=64 Gb/s=8.0 GB/s). Assuming that 95% of the PCIe bandwidth is used for actual data, and the other 5% is used for overhead, the available bandwidth for PCIe data on the link is 7.6 GB/s (0.95*8.0 GB/s=7.6 GB/s). The system 100 also comprises a 16-lane SAS 2.0 (Serial Attached Small Computer System Interface) link 102 as the interface between the storage controller 111 and the storage devices 112. The SAS interface bandwidth is 12 GB/s (16 lanes*6 Gb/s per lane=96 Gb/s=12 GB/s). Assuming that 85% of the SAS bandwidth is used for actual data, and the other 5% is used for overhead, the available bandwidth for SAS data on the link is 10.2 GB/s (0.85*12 GB/s=10.2 GB/s).
Therefore, in this exemplary storage system, the minimum required throughput is 7.6 GB/s. It is difficult to get single hardware engine to process data fast enough in order to handle 7.6 GB/s traffic.
A known solution to this performance bottleneck is the use of multiple instances of the hardware engine such that the required processing bandwidth can be distributed across the multiple instances of the hardware engine. Depending on the interface bandwidth, the system can be scaled to meet different performance requirements. For example, in an IO (input/output) processing system that uses high speed interfaces, such as a PCIe (PCIe Gen 3, 8 Gbps) for host interface and SAS (SAS 2.0, 6 Gbps) for disk interface as described above, where the IO processing rate is limited by the speed of the data processing engine, multiple data processing engines can be integrated into the system to match the processing throughput with the interface throughput.
The storage controller of the storage system example above may use encryption hardware to encrypt data from the host before it is written to the storage devices. Typical encryption hardware engines have a throughput of approximately 1.5 GB/s. Therefore, at least 6 instances of the encryption hardware engine are required to meet the 7.6 GB/s PCIe bandwidth.
In command-based IO processing systems, to maintain IO coherency in a data flow, the storage controller has an IO processor (IOP) that splits the IO data into small frames and creates separate IO commands for each IO data frame. Each IO command describes how the respective IO data should be processed by system.
For example, consider an IO operation where 64 KB of raw data is transferred from host to disk and encryption is performed on every 4 KB sector of data. For such an IO operation, the IOP may initiate multiple data transfers, each transfer moving 1 KB of data from host memory into on-chip memory. The IOP will then create commands for the encryption engine(s) of the storage controller to process each of the 1 KB data blocks. These commands are loaded into a command queue.
Since each 1 KB block represents a fraction of one full encryption data sector (4 KB), intermediate Initial Value (IV) results are obtained after processing each 1 KB block of data. These IVs are stored in a global data structure, called an IO context, for each 4 KB sector of IO data. The partial result at the end of the first 1 KB block needs to be updated in the IO context before the second 1 KB block can start processing. The encryption for the second 1 KB block is calculated starting with the IV obtained from processing the first 1 KB block. This means that the IO data blocks of the same IO data sector need to be processed in sequence. This also implies that two IO data blocks of the same IO data sector cannot be processed in parallel on two different encryption engine instances at the same time.
Processing data in separate operations in parallel (either in different threads of an engine or in different hardware engines) creates scheduling requirements such as the ones described above. Additional scheduling problems may arise based on various inefficiencies of the processor associated with these scheduling requirements. It is, therefore, desirable to mitigate or obviate these inefficiencies and their deleterious effects.