Graphics processing units (GPUs) have emerged as a first-class computing platform. The massive data parallelism of GPUs was first leveraged by highly-structured parallel tasks such as matrix multiplications. However, GPUs have more recently found use across a broader range of application domains such as deep learning, graph analytics, data analytics, computer-aided-design, oil and gas exploration, medical imaging, and computational finance. Memory accesses from many of these emerging applications demonstrate a larger degree of irregularity—accesses are less structured, and are often data dependent. Consequently, they show low spatial locality (e.g., if a memory location is accessed at a particular time, then it is unlikely that the memory location and/or nearby memory locations will be accessed in the near future).
It has been demonstrated that that such irregular memory accesses may slow down an irregular GPU application by up to 3.7-4 times due to address translation overheads alone. Further, the negative impact of access irregularity may be greater on the address translation than on the data access itself. For example, compared to one memory access on a cache miss, a miss in the translation lookaside buffer (TLB) triggers a page table walk that may take up to four sequential memory accesses in the prevalent x86-64 or ARM computer architectures.
Typically, a hardware page table walker (PTW) walks (accesses) the in-memory page table to find the desired translation. To match a GPU's need for large memory bandwidth, hardware designers often provision multiple independent PTWs. However, TLB misses from irregular GPU applications, and their corresponding page table walk requests, often happen in bursts. Consequently, they add significant queuing delay in the critical path of execution. A cache access cannot begin until the corresponding address translation completes, since modern GPUs tend to employ physically-tagged caches.
FIG. 1 is a schematic block diagram illustrating an example of a traditional apparatus that includes a traditional system architecture 100. The traditional system architecture 100 includes a GPU 102, a central processing unit (CPU) 120, an input output memory management unit (IOMMU) 132, and device or system memory 130. GPUs, such GPU 120, are designed for massive data-parallel processing that operates on hundreds to thousands of data elements concurrently. The hardware resources of GPU 120 are typically organized in a hierarchy to effectively manage the massive concurrency.
The top part of FIG. 1 depicts the architecture of a typical GPU 102. Compute Units (CUs) 104 are the basic computational blocks of the GPU 102, and there are typically 8 to 64 CUs 104 in a GPU 102. Each CU 104 includes multiple Single-Instruction-Multiple-Data (SIMD) units 108, each SIMD unit 108 has multiple (e.g., 16) lanes of execution. GPU 102 threads are scheduled to SIMD units in a bundle called a wavefront (or warp), which is typically composed of 32 or 64 threads. These wavefronts execute the same instruction with different data on a SIMD unit. A Single-Instruction-Multiple-Thread (SIMT) unit 108 is mapped to the SIMD units by using execution masks in the case where GPU 102 threads in the same wavefront follow different execution paths.
The memory resources of the GPU 102 are also arranged in a hierarchy. For example, each CU 104 has a private L1 data cache 107 and a scratchpad 110 that are shared across the SIMD units 108 within the CU 104. When several data elements accessed by a SIMD instruction reside in the same cache line, a hardware coalescer 113 in the GPU 102 combines these requests into a single cache access to gain efficiency. Further, a large L2 cache 112 is shared across all CUs 104 in a GPU 102. Also, a global data store 114 is scratchpad memory for the GPU 102, and is used by GPU applications to improve performance.
General-purpose GPU (GPGPU) programming languages, such as OpenCL and CUDA, expose to the programmer a hierarchy of execution groups that map directly to hardware resources. A work-item or thread executes on a SIMD lane while 32 to 64 work-items execute in a lock-step fashion on a SIMD unit 108 of a CU 104. Several such wavefronts form a work-group that executes on a single CU 104. Work-items in a work-group can efficiently synchronize and share data through the on-chip scratchpad memory 110. Additionally, multiple work-groups make up a GPU 102 program or kernel.
GPUs 102 have adopted several key programmability-enhancing features as they mature to become first-class computational units. One such feature is a shared virtual memory (SVM) across the CPU 120 and the GPU 102. For example, full compliance with industry promoted standards like the Heterogeneous System Architecture (HSA) requires GPUs 102 to support SVM. To enable shared virtual memory in modern commercial GPUs 102, a shared page table between CPU 120 and GPU 102 via the IO Memory Management Unit (IOMMU) hardware 132 may be used.
The bottom part of FIG. 1 depicts the key hardware components and/or entities of the SVM in a traditional system architecture 100. Conceptually the key enabler for SVM is the GPU's 102 ability to walk the same page table (e.g., the x86-64 page table) as the CPU 120 via the IOMMU 132. By sharing the same page table, a process running on a GPU 102 may observe the same virtual-to-physical address mapping as a process running on a CPU 120 and consequently, share the same virtual memory.
In other words, by sharing the same virtual memory, virtual memory addresses that are misses in the GPU's 102 TLBs 106 may be mapped to physical memory addresses using the IOMMU 132 of the CPU 120. FIG. 2 illustrates an example of an in-memory page table 208 and virtual addresses 202, 204, and 206. In particular, FIG. 2 will be used to describe how a hardware page table walker 134 in an IOMMU 132 translates three example virtual addresses 202, 204, 206 by walking an in-memory page table 208. For example, if the system architecture 100 is an x86-64-based implementation, a processor's page table is structured as a 512-ary radix tree with four levels, and thus, requires four memory accesses to complete a page walk. The virtual memory addresses 202, 204, 206 are first broken into virtual page numbers 220 and a physical page offset 222. By walking the page table 208, the virtual page numbers 220 are translated to physical page frame numbers (PPN), which is eventually concatenated with the page offset 222 to generate the final physical address.
In one example, a page walk starts by looking up the 4 KB page containing the root node 210 of a page table (also called level L4 nodes). The root node 210 has 512 pointers to nodes at the next level of the tree (referred to as the L3 nodes). The top nine bits of a 48-bit wide virtual address 202, 204, 206 (bits 47 to 39), known as the L4 index, are used to index into the root node 210 to find the pointer to the appropriate L3 node (e.g., L3 node 212). An L3 node, such as L3 node 212, also contains 512 pointers to nodes in the next level (referred to as the L2 level nodes). The next nine bits of the VA 202, 204, 206 (bits 38-30), known as the L3 index, are used to index into the L3 node 212 to reach an appropriate node in the L2 level (e.g., L2 node 214). Similarly, bits 29 to 21, known as the L2 index, are used to index into the L2 node 214 to reach a node in the leaf level (referred to as the L1 level). An L1 node (e.g., L1 nodes 216, 218) contains 512 physical page frame numbers, each corresponding to a 4 KB virtual address page. Bits 20 to 12 of the VA, known as the L1 index, are used to index into the L1 nodes 216, 218 to find the desired physical page frame number. Herein, levels L4 to L2 are referred to as upper levels of the page table 208.
In some examples, the first two virtual addresses (VA0 202 and VA1 204) share the same entries for all upper levels of the page table (e.g., L4 node 210, L3 node 212, and L2 node 214). Similarly, the third virtual address (VA2 206) shares entries in first two upper level (e.g., L4 node 210 and L3 node 212) of the page table 208. Traditional hardware page walk caches (PWCs) may exploit this redundancy. For example, traditional PWCs (not displayed in FIG. 1) are located within the IOMMU 132 and store recently-used page table entries from the upper-levels of a page table. Hits in PWCs reduce the number of memory accesses needed for a walk by up to three memory accesses depending upon which upper level (L4, L3 or L2) produces the hit. For example, a hit for the entire upper level (L4, L3 and L2) will need just one memory request to complete the walk by accessing only the leaf node (L1). In contrast, a hit for only the root level requires three memory accesses. In the worst case, a complete miss in the PWCs requires four memory accesses to complete a page walk.
An IOMMU 132 typically houses multiple independent page table walkers 134 (e.g., 8-16) to concurrently service several page table walk requests. Multiple walkers 134 are important since GPUs 102 demand high memory bandwidth and, consequently, often send many concurrent walk requests. The IOMMU 132 itself has two levels of TLBs and logic 138 to cache recently used address translations, but they are relatively small and designed to primarily serve devices that do not have their own TLBs (e.g., a Network Interface Controller). Page walk requests typically queue up in IOMMU's page walk request buffer 136 (e.g., IOMMU buffer) before beginning the walk. When a walker 134 becomes free (e.g., after it finishes servicing a walk), it starts servicing a new request from the IOMMU buffer 136 in the order it arrived.
GPUs 102 typically have a sophisticated TLB hierarchy 106, 118 to reduce the number of page walks. A TLB, such as a TLB from the hierarchy 106, 118, caches recently-used address translation entries to avoid accessing in-memory page tables on every memory access. When multiple data elements accessed by a SIMD instruction reside on the same page, only a single virtual-to-physical address translation is needed. The hardware coalescer 113 exploits this locality to look up the TLB hierarchy 106, 118 only once for such same-page accesses. Each CU has a private L1 TLB 106. Misses in an L1 TLB 106 looks up a larger L2 TLB 118 that is shared across all the CUs in the GPU 102. A translation request that misses in both levels 106, 118 is forwarded to the IOMMU 132.
In some variations, an address translation request is generated when executing a SIMD memory instruction (load/store). The hardware coalescer 113 merges multiple requests to the same page (e.g., 4 KB) generated by the same SIMD memory instruction. The hardware coalescer 113 looks up the virtual address translation request in the GPU's L1 TLB 106 and then the GPU's shared L2 118 (if it misses in the L1 TLBs 106). On a miss in the GPU's L2 TLB 118, the GPU 102 sends the request to the IOMMU 132. The IOMMU 132 first looks up the request in the IOMMU's TLBs and logic 138.
On a miss in the IOMMU's TLBs 138, the IOMMU 132 queues up the request as a page walk request in the IOMMU buffer 136. When an IOMMU's page table walker 134 becomes free, it selects a pending request from the IOMMU buffer 136 in first-come-first-serve order. The page table walker 134 first performs a lookup in the PWC, and then completes the walk of the page table, generating one to four memory accesses. On finishing a walk, the desired translation is returned to the IOMMU 132, the GPU TLBs 106, 118, and the CUs 104.
CPUs, such as the traditional CPU 120, include one or more cores (e.g., core 0 122 and core 1 124) for executing instructions and/or providing address translations between virtual and physical memory addresses. In some examples, the cores 122 and 124 include one or more TLBs, one or more PTWs, and a private cache (L1). The cores 122 and 124 may share a shared cache (e.g., L2 and/or L3 cache) 126. The functionalities of the components and/or entities of the CPU 120 (e.g., TLBs, PTWs, L1 and L2 caches, and the cores 122 and 124) are similar to the functionalities of the corresponding components in the GPU 102 described above. Further, the CPU 120 includes a memory controller 128 that is connected to the IOMMU 132 and memory 130. Memory 130 is system memory. The memory controller 128 performs memory accesses for the cores 122, 124, and/or the IOMMU 132. For example, the memory controller 128 accesses the memory 130 and/or the shared cache 126 based on requests from the IOMMU 132.
As explained above, irregular memory accesses cause memory access divergence in the GPU's execution model for translating between virtual and physical addresses. For example, although different work-items within a wavefront execute the same instruction, they access data from distinct page entries. This renders the hardware coalescer 113 ineffective, and consequently, many concurrent TLB accesses are generated by execution of a single SIMD load/store instruction. Furthermore, many of these requests often miss in the TLB owing to low access locality of irregular applications (e.g., subsequent requests are unlikely to access the same memory location as a previous request in the near future). Eventually, these address translation requests queue up in the IOMMU buffer 136 to be serviced by the page table walkers 134. The significant queuing latency at the PTWs 134 ultimately slows down applications. Further, due to the many distinct page entries, the PWCs are also rendered less effective. Additionally, with caching (e.g., using PWCs), even if the required PTE is available in the newly added IOMMU 132 PWCs, a separate PTW 134 still needs perform that walk. Often the unavailability of free PTWs 134 leads to significant queuing delay in servicing walk requests. Furthermore, besides the performance implications, a PWC for PTEs needs to be kept coherent since the latest PTEs can even be in the CPU L1 cache 126. Extending coherence to a non-core component, like the IOMMU 132, incurs significant design and energy overheads. Accordingly, there exists a need for one or more improved methods and devices to address one or more of the above-noted drawbacks.