Hardware based acceleration systems are used for offloading all or part of an application's processing requirements from the CPU to some other device, such as a field-programmable gate array (FPGA), graphics processing unit (GPU), or digital signal processor (DSP). These special “acceleration devices” have some common characteristics, such as a high number of cores, fast memory, and high degree of parallelization that enables particular workloads to be executed on them at a much higher speed than on general-purposes CPUs. However, these devices have no standalone Operating Systems (“OS”) and hence are under the control of the OS running on the CPU that controls the input/output (“I/O”) subsystems, including accelerator devices, file systems, and memory transfers.
As commonly implemented, the accelerator devices receive data to be processed from the CPU or memory, typically via a direct memory access (“DMA”) operation. After the data has been processed it is returned to the CPU or its memory. This method of offloading data processing is relatively efficient when the size of the data is small (e.g., less than 1 GB) and/or already resides in the memory of the CPU. However, data is often stored on secondary storage devices, such as hard disks or SSDs. This is especially true when the size of the data to be processed by the accelerator is relatively large. In this case, it may be very inefficient to transfer large amounts of data to the CPU's memory before sending the data to the accelerator device.