Network operators and communication service providers typically rely on complex, large-scale computing environments, such as high-performance computing (HPC) and cloud computing environments. Due to the complexities associated with such large-scale computing environments, communication performance issues in such HPC systems can be difficult to detect and correct. This problem can be more difficult for performance anomalies (e.g., incast), which can result from the dynamic behavior of an application running on the system or the behavior of the system itself. For example, partitioned global address space (PGAS) applications that perform global communication at high message rates can incur network congestion that may be exacerbated by system noise, which may result in performance degradation.
Performance measurement techniques may rely on various metrics (e.g., timing sections of code) combined with hypothesis testing. For example, if code interacts with the network, such timing can reveal attributes of network behavior. However, HPC system networks achieve higher performance in part by decoupling software actions (e.g., requesting a message be sent) from network hardware actions (e.g., sending the message). As such, the utility of such software approaches can have a negative impact. A further complication is higher performance is typically achieved with higher message rates, which increases the overhead of approaches based on per-message code measurement.
Performance measurement may also rely on network probing technologies (e.g., ping, traceroute, etc.). These approaches have been developed for large-scale networks with relatively low message rates at each endpoint, and, often, with relatively large messages. HPC networks typically have dramatically higher message rates and sub-microsecond latencies. Applying prior technologies in a straightforward way can dramatically increase overheads. This, in turn, requires specialized measurement techniques to avoid introducing system noise and to capture high-fidelity measurements. In some cases, a lack of network hardware support can lead to higher rates of expensive software processing; while in other cases, the lack of hardware support may lead to data simply not being available. For example, HPC networks often do not provide any mechanism by which to examine the path taken by a specific message packet.