The present disclosure relates to systems and methods for distributed data processing. In particular, the present disclosure relates to methods for offloading data processing tasks using in-storage code execution.
There are many applications that require data processing of some kind, for example, scanning a data set looking for some pattern, data sorting, index building, data compaction, etc. These include “Big Data” applications involving MapReduce tasks. Host devices executing such applications may be required to read the whole data set from the storage device for data processing and then write the newly generated resulting dataset back to storage after the data processing is completed. Such activities can often generate a lot of traffic on the bus or network between the host and storage, unnecessarily burden the host, and consume a significant amount of power.
A current approach for overcoming the above problems can include offloading the data processing tasks from the host device to the storage device. In this approach, the host compiles the source code for the data processing task and sends the resulting compiled binary code to the storage device to be executed inside the storage device. However, this approach for offloading data processing tasks to the storage device suffers from a number of drawbacks. First, the data processing task can be offloaded to only a limited range of storage devices having specific hardware architectures that are compatible with the compiled binary code. Second, the binary code sent from the host to the storage device may be vulnerable to security issues requiring the use of complicated and resource-intensive measures (e.g., tunneling, containers) on the storage device to protect the integrity of the data processing task. Third, the data processing task may not run optimally on the storage device due to the inability to account for the real-time operating conditions that may occur on the storage device during execution.
Thus, there is a need for a method and system that, for example, offloads data processing tasks onto storage systems in a flexible, secure, and resource efficient way that provides optimal performance.