Network operators and communication service providers typically rely on complex, large-scale computing environments, such as high-performance computing (HPC) and cloud computing environments. Due to the complexities associated with such large-scale computing environments, communication performance issues in such HPC systems can be difficult to detect and correct. This problem can be more difficult for performance anomalies (e.g., incast), which can result from the dynamic behavior of an application running on the system or the behavior of the system itself.
For example, partitioned global address space (PGAS) applications that perform global communication at high message rates can incur network congestion that may be exacerbated by system noise, which may result in performance degradation. Accordingly, HPC systems typically depend on efficient use of the inter-node fabric; however, for many applications, performance is generally limited by fabric performance, as well as by processor, memory, or mass storage performance. Further, due to the complexities associated with HPC system architectures, such fabric performance may be difficult to measure and traffic patterns difficult to understand, making it difficult to identify a root cause of performance problems within the fabric.