Modern processors often include caches to improve the performance of accessing external memory by providing copies of instructions and/or data in smaller faster memories with shorter access latencies. In addition such caches may provide support for fast virtual to physical address translations using a device such as a translation lookaside buffer (TLB) to cache virtual to physical address translations, for example from the system page tables in a paged virtual memory system. When a TLB matches a virtual address to one of the translations stored in the TLB, we may refer to such and event as a TLB hit and the retrieved physical address can be used to access memory in a cache or in main memory more quickly. When a TLB fails to match a virtual address to one of the translations stored in the TLB, we may refer to such and event as a TLB miss or a page miss, and the translation proceeds by looking up the corresponding page tables in memory.
A typical multi-core processor or multi-core processing system may have various homogeneous or heterogeneous processing devices—for example, some number of homogeneous processor cores together with one or more heterogeneous graphics processors or video processors, etc. These homogeneous or heterogeneous processing devices may share virtual memory address spaces. Typically, the homogeneous processor cores may have cache hierachies and TLBs as described above to cache virtual to physical address translations from the system page tables in a paged virtual memory system. In modern processors the homogeneous processor cores may also include a page-miss handler (PMH) in hardware to perform page walks through system page tables more quickly. When one of these homogeneous processor cores tries to access a page in virtual memory that is not mapped to anything in physical memory, the memory access will cause a page fault, and the operating system will be invoked to “fix” the problem.
The heterogeneous processing devices, such as graphics processors may also have mechanisms such as TLBs for performing virtual to physical address translations, but typically there is some software that runs on the homogeneous processor cores and may communicate with drivers for the heterogeneous processing devices to set up page tables, etc. for some limited sharing of the paged virtual memory system with the heterogeneous processing devices. But unlike in the homogeneous processor cores, these mechanisms may have limited capabilities, lacking the ability to handle page faults, for example. Therefore the pages that are shared with these heterogeneous processing devices may need to be locked down in memory, preventing the operating system from paging them in from, and out to disk as demand dictates. Thus the number of applications that can be supported may be limited, or the amount of physical memory required may be greater than it would otherwise have been. Further the software that runs on the homogeneous processor to set up page tables, etc. for sharing paged virtual memory with the heterogeneous processing devices incurs a performance overhead, which may reduce the effectiveness of using the heterogeneous processing devices.
When software running on the homogeneous processor cores (e.g. the operating system) makes changes to the page tables, the corresponding translations that are cached in various TLBs throughout the system must be invalidated. This is known as a TLB shoot-down. One simple approach is to broadcast an invalidation message to all devices in the system, but this may cause unwanted communication congestion. Moreover, page tables set up for the heterogeneous processing devices may not be the same page tables used by the homogeneous processor cores, and the mechanisms used in the heterogeneous processing devices may have limited capabilities to invalidate their TLB entries and to recover new page table information on their own.
In order to avoid unwanted congestion, operating systems have used a form of lazy invalidation scheme. Since the operating system is running on the homogeneous processor cores and has access to which of the processes, applications or threads are active on the various homogeneous processor cores, it could “shoot-down,” or trigger an invalidation of TLB entries as necessary, i.e. only in the TLBs of cores which were actively running a process that used that TLB entry, and/or in the TLBs of cores that may want to activate a process that will use that TLB entry. Thus unnecessary activity and overhead for TLB shoot-downs could be avoided. But since the operating system does not run on the heterogeneous processing devices, it cannot track which processes are active and which TLB entries may need to be invalidated, even if we assume that the heterogeneous processing devices are capable of recovering the new page table information and/or generating page faults. Therefore a broadcast would need to be sent to all of the heterogeneous processing devices potentially sharing the process virtual address space.
As the number of processing cores and other heterogeneous processing devices accessing caches or using virtual memory increases, there may be good reason to expect problems, such as additional memory congestion, and conflicts when virtual memory space is shared by more of these devices. To date, potential solutions to such reasonably expected problems have not been adequately explored.