The present disclosure generally relates to heterogeneous computer computation and appliances, and a computational framework for fine-grained multi-threaded message passing that exploits data parallelism in irregular algorithms. Specifically, the present disclosure relates to a fine-grained multithreaded message passing apparatus that can efficiently exploit data parallelism in irregular algorithms, and can be paired and used as an appliance with medium to high-end general purpose server systems.
Though systems like Cray's MTA multi-threaded architectures are designed to execute irregular algorithms more efficiently than traditional computer architectures, these systems tend to be for large scale supercomputing and have hard-to-use programming abstractions. The present disclosure provides for an apparatus to be used with general purpose server systems and an easy-to-use programming abstraction, but provides the fine-grained multi-threaded message passing that efficiently exploits data parallelism.
Memory bound and irregular algorithms may not fully and efficiently exploit the advantages of conventional cache memory-based architectures. Furthermore, the cache memory and other overheads associated with general-purpose processors and server systems contribute to significant energy waste. Examples of such algorithms include graph processing algorithms, semantic web processing (graph DBMS), and network packet processing.
With single-core clock frequency remaining stagnant as power constraints have limited scaling, it has become imperative that irregular algorithms will be better served in parallel multiple core processing environments. Programs need to be rewritten to run in parallel on multicore architectures to meet performance objectives. However, there is as yet no efficient, popular, parallel programming abstraction that a programmer can use productively to express all kinds of program parallelism. Furthermore, it isn't clear that traditional shared-memory homogeneous multicores can continue to scale exponentially over the next decades while maintaining the current power-performance budget. Recent trends suggest that asymmetric and heterogeneous multicores with application-specific customizations and even fixed-function accelerators will be required to meet power-performance goals.
These algorithms tend to have large amounts of irregular data-parallelism that is nevertheless difficult for conventional compilers and microprocessors to exploit.