Data analytics and “Big Data” processing have become increasingly important in recent years. Data analytics and Big Data workloads require processing huge amounts of data. One approach for processing such huge amounts of data is to distribute the processing tasks across large numbers of servers and process the workload in parallel. For example, the Apache Hadoop software framework enables tasks to be distributed across large numbers of commodity servers and process workloads using MapReduce. While Hadoop and MapReduce provide excellent scalability, they require a tremendous amount of inter-server communication (when implemented at large scale), and do not efficiently use processor and memory resources.
Some compute and memory-bandwidth intensive workloads such as used for data analytics and “Big Data” are hard to get the required level of performance with processor cores. To address this, so-called “accelerators” have been developed. Accelerators were initially implemented as components that were coupled to CPUs (central processing units) and managed as an IO (input-output) device with its own address space, which requires significant levels of IO communication to transfer data between the accelerator address space and applications running in system memory address space. Recently, CPUs employing System on a Chip (SoC) architectures with embedded accelerators have been introduced.
Accelerators have steadily improved in capability with one of the most significant recent trends being “shared virtual memory” (SVM) capable accelerators. The traditional accelerator needed to be managed as an input-output (TO) device in its own personal address space; this was accomplished with expensive kernel-mode drivers (KMD) that needed applications to cross back and forth between user and kernel-space, pinning pages in memory or copying user buffers to/from special buffers managed by the OS/Kernel-mode-driver. With SVM, the accelerator or IO device can directly work on the address space of any user application thread, as it shares the same virtual→physical address translation capabilities as the CPU thread. This is a key improvement in accelerator efficiency (from the point of view of data movement), enables user-mode submissions directly to the accelerators (via a “user-mode-driver” or UMD) and results in easier programming models and adoption.
However, for applications that need low-latency processing (especially for small buffer processing), SVM also poses an interesting challenge. When an accelerator is provided a job to work on, the job descriptor identifies some input data buffers and output data buffers in virtual memory space the accelerator is to access. These buffers are allocated by the user application, and thus may generally comprise many different physical memory pages, depending on the sizes of the buffers. The accelerator needs to be able to translate the virtual addresses (VA) to physical addresses (PA) to work on the job. This address translation adds latency overhead to traditional accelerator designs.