Field of the Disclosure
The present disclosure relates generally to processing systems and, more particularly, to caching information in processing systems.
Description of the Related Art
Processing systems such as accelerated processing units (APUs) can include multiple sockets. Each socket implements one or more central processing units (CPUs) and one or more accelerators such as graphics processing units (GPUs), which are collectively referred to as processing units. The sockets are associated with corresponding memories, such as dynamic random access memories (DRAM), High-Bandwidth Memory (HBM), or other memory types, depending on platform configuration. The processing units also include caches for caching data associated with the execution of scheduled tasks. The sockets are interconnected by corresponding bus interfaces. The APUs can support non-uniform memory access (NUMA) so that each processing unit can access memories associated with the same (local) socket or one or more different (remote) sockets. The latency for a processing unit to access the DRAM or HBM associated with a local socket is lower than the latency for the processing unit to access memory associated with remote sockets.
Maintaining cache coherency between caches associated with processing units on different sockets incurs a significant cost in terms of performance or latency due to the large number of cache probes that need to be transmitted over the (relatively low bandwidth) socket interfaces. For example, GPU operations are typically memory intensive and each memory transaction initiated by a GPU requires transmitting corresponding probes over socket interfaces to the other sockets to maintain cache coherency between the sockets. Thus, each memory transaction incurs a latency cost due to the time required for the probe response from other caches. Many operating systems are “NUMA-aware” and are therefore able to avoid most memory transactions that would cross the socket interfaces by storing data for processes running on a GPU in the local memory associated with the GPU's socket. However, some virtual memory resources for the GPU may still be stored in the physical memory associated with a different socket, e.g., if the local memory resources are exhausted, if the memory is used for inter-thread communication, or if the application that initiated the process is not NUMA-aware.