Demands by individuals, researchers, and enterprises for increased compute performance and storage capacity of computing devices have resulted in various computing technologies developed to address those demands. For example, compute intensive applications, such as enterprise cloud-based applications (e.g., software as a service (SaaS) applications), data mining applications, data-driven modeling applications, scientific computation problem solving applications, etc., typically rely on complex, large-scale computing environments (e.g., high-performance computing (HPC) environments, cloud computing environments, etc.) to execute the compute intensive applications, as well as store voluminous amounts of data. Such large-scale computing environments can include tens of hundreds (e.g., enterprise systems) to tens of thousands (e.g., HPC systems) of multi-processor/multi-core network nodes connected via high-speed interconnects (e.g., fabric interconnects in a unified fabric).
To carry out such processor intensive computations, various computing technologies have been implemented to distribute workloads across different network computing devices, such as parallel computing, distributed computing, etc. In support of such distributed workload operations, multiprocessor hardware architecture (e.g., multiple multi-core processors that share memory) has been developed to facilitate multiprocessing (i.e., coordinated, simultaneous processing by more than one processor) across local and remote shared memory systems using various parallel computer memory design architectures, such as non-uniform memory access (NUMA), and other distributed memory architectures.
Accordingly, memory requests from multiple interconnected network nodes can occupy the same shared buffer (e.g., super queues, table of requests, etc.) as local memory requests of a particular network node. However, such shared buffers are limited in size (e.g., containing tens of entries), which can result in other memory requests being queued until data returns from the memory subsystems for those memory requests presently in the shared buffer. As such, entries of the shared buffers tend to be occupied by those memory requests targeting memory that provides high latency access (e.g., memory requests received from remote network nodes) or that is being over-utilized. As a result, other requests (e.g., local memory requests) targeting faster or non-congested memory (i.e., memory requests that would be served faster) can become starved in the core due to no available shared buffer entries available to execute said memory requests.