Computing technology has advanced at a remarkable pace, with each subsequent generation of computing system increasing in performance, functionality, and storage capacity, often at reduced cost. However, despite these advances, many scientific and business applications still demand massive computing power, which can only be met by high performance computing systems. One particular type of computing system architecture that is often used in high performance applications is a parallel computing system.
One type of a parallel computing system includes a host element that sends data to or receives data from a plurality of accelerator, or “target”, elements. For example, the host element generally includes a processor, portion thereof, or processing node that determines whether to send and what data to send to the target elements, which are also generally a processor, portion thereof, or processing node. These parallel computing systems often provide benefits in acceleration, which is the act of off-loading computationally intensive functions to the target elements. However, acceleration only provides a benefit if the data processed by the target elements can be moved to and from that target element efficiently. Moreover, target elements often have environment constraints. Both of these complicate the design of conventional applications, which must take into account the size of the data to move to and from the target elements, as well as any environmental constraints. This, in turn, often adds to the development and execution costs for conventional applications, as well as prevents the applications from being reused on other platforms.
Moreover, any stored data required by an application is typically moved to local memory of the host element to later be used by a target element. It is thus often desirable to overlap the retrieval of new data with execution of previously retrieved data to avoid I/O delays. However, depending on the computational complexity of a given application, it is generally difficult to perform such overlap. For example, computational requirements and data access patterns of the application, host element, or target elements are subject to change. As such, what may be optimal data retrieval at one point is sub-optimal at a second point. Moreover, environments of different parallel computing systems vary in pipelines available to retrieve data, memory available to store retrieved data, the number of target elements, and/or other resources that may be used to execute the application. As such, generic mechanisms to retrieve data may overload one type of parallel computing system while being underutilized for another type of parallel computing system. In turn, this may lead to additional latencies or wasted resources.
Consequently, there is a continuing need to more efficiently and accurately configure applications across a parallel computing system. Moreover, there is a continuing need to more efficiently and accurately overlap data retrieval and application execution in a parallel computing system.