Computing applications utilizing large amounts of input and output to and from external devices are becoming increasingly common. Such applications typically require the transfer of large quantities of data between memory, one or more central processing units (e.g., processor cores), and/or external devices. For example, the Peripheral Component Interconnect Express (PCIe) interface utilizes a direct memory access in which the PCIe devices may directly access main memory. These DMA accesses directly contend with the processor cores for memory access. As the amount of DMA accesses grow, typical solutions normally employed to counter sluggish performance (e.g., adding more processing power or memory) become ineffectual as the performance problems relate not to a lack of memory capacity or processing power, but rather to those components spending more time contending for access to the memory bus and less time actually processing the data.
For example, a common computer architecture is a symmetric multiprocessing (SMP) architecture which utilizes multiple processors or multiple processor cores (for multiple core processors the SMP architecture treats each core as a separate processor). All the processors access a shared main memory across a bus and are controlled by a single instance of an operating system. Since the memory is shared amongst all components, contention for access to the memory is common and represents a major challenge to the scalability of this architecture. As already noted, simply adding more or faster processor cores or faster memory begins to lead to diminishing performance gains as the contention for the bus and for memory quickly become major bottlenecks as CPU cycles which might otherwise be used for computing may be idled while waiting for the bus.
One attempt at solving this problem is a Non-Uniform Memory Access (NUMA) architecture, which dedicates different memory banks to different processors. The downside of this architecture (among other things) is that moving data from one processor (e.g., a job or task) to another is expensive and thus load balancing jobs between cores may be time consuming and difficult. This may lead to reduced performance, as additional CPU clock cycles would need to be expended in moving these jobs. Again, this architecture does little to address applications heavy in I/O.