1. Field of the Disclosure
This disclosure relates generally to parallel processing, and more particularly to systems and methods for performing message driven prefetching at the network interface for distributed applications that employ a function shipping model.
2. Description of the Related Art
Existing computer systems implement prefetching in several different ways and for a variety of purposes. For example, one commonly used method for prefetching (which is sometime referred to as “software prefetching”) is to include a “prefetch” instruction in the processor instruction-set-architecture (ISA). In systems that employ software prefetching, these prefetch instructions can be placed in the application code to bring data into the cache hierarchy from memory before the processor requires it. These prefetch instructions can be inserted into the application code either explicitly by the programmer or automatically by the compiler. In cases in which the data access pattern is not known beforehand, prediction algorithms can be employed by the application and/or the compiler to issue prefetches on a somewhat speculative basis.
Another existing approach for prefetching, which is sometimes referred to as “hardware prefetching” does not require the explicit use of prefetch instructions to perform prefetching. Instead, systems that employ hardware prefetching typically rely on special hardware predictors that monitor memory requests issued by the processor to infer future accesses. These systems typically employ learning algorithms to predict future memory accesses based on the recent history of accesses.
Some existing network interfaces support fast user-level messaging mechanisms (e.g., for communication between the processes or threads of parallel programs). In a large cluster environment, such mechanisms remove operating system and message copy overhead, and can lead to improvements in latency and network bandwidth usage, especially in the case of small messages. In applications that employ a function shipping model of parallel processing, the inter-process/thread messages are typically requests that contain a reference (or pointer) to one or more data items, along with other processing specifications. In other words, applications that employ function shipping often access local data structures based on the content of a received message. In such cases, upon receiving a message, the user application accesses local data that is directly or indirectly referenced by the contents of the message. In the case that these applications include large problems (e.g., computations that operate on extremely large data sets, which are sometimes referred to as “big-data computations” or “data-intensive applications”), and due to the random access patterns they typically experience, these accesses almost always miss in the cache hierarchy, causing processors stalls. In such cases, the benefits of fast messaging with zero-copy optimizations are typically not fully realized, as the bottleneck is shifted to the local data accesses that are triggered by the receipt of the messages.