Data analytics and “Big Data” processing have become increasingly important in recent years. Data analytics and Big Data workloads require processing huge amounts of data. One approach for processing such huge amounts of data is to distribute the processing tasks across large numbers of servers and process the workload in parallel. For example, the Apache Hadoop software framework enables tasks to be distributed across large numbers of commodity servers and process workloads using MapReduce. While Hadoop and MapReduce provide excellent scalability, they require a tremendous amount of inter-server communication (when implemented at large scale), and do not efficiently use processor and memory resources.
Some compute and memory-bandwidth intensive workloads such as used for data analytics and Big Data are hard to get the required level of performance with processor cores. To address this, so-called “accelerators” have been developed. Accelerators were initially implemented as components that were coupled to CPUs (central processing units) and managed as an IO (input-output) device with its own address space, which requires significant levels of IO communication to transfer data between the accelerator address space and applications running in system memory address space. Recently, CPUs employing System on a Chip (SoC) architectures with embedded accelerators have been introduced.