High-performance computing (HPC) applications typically execute calculations on computing clusters that include many individual computing nodes connected by a high-speed network fabric. Typical computing clusters may include hundreds or thousands of individual nodes. Each node may include several processors, processor cores, or other parallel computing resources. A typical computing job therefore may be executed by a large number of individual processes distributed across each computing node and across the entire computing cluster.
In HPC workloads, a sequence of library functions from one or more libraries may be called. The results of a call to a function of a library are typically consumed immediately by the next call to another library function, leaving the results as dead. Such temporary results are generally large arrays with significant space overhead. Additionally, library functions are typically constructed as stand-alone binary code, generally including a defined interface to allow the behavior of the library functions to be evoked, such as an application programming interface (API). Typically, the interface enables an application compiler to call the functions of the library individually. As such, the sequence of library function calls may not be effectively optimized across the boundaries of the individual library functions.