Field of the Invention
The present invention is generally directed to computing systems. More particularly, the present invention is directed to computer registers in the context of memory address translations.
Background Art
The desire to use a graphics processing unit (GPU) for general computation has become much more pronounced recently due to the GPU's exemplary performance per unit power and/or cost. The computational capabilities for GPUs, generally, have grown at a rate exceeding that of the corresponding central processing unit (CPU) platforms. This growth, coupled with the explosion of the mobile computing market (e.g., notebooks, mobile smart phones, tablets, etc.) and its necessary supporting server/enterprise systems, has been used to provide a specified quality of desired user experience. Consequently, the combined use of CPUs and GPUs for executing workloads with data parallel content is becoming a volume technology.
However, GPUs have traditionally operated in a constrained programming environment, available primarily for the acceleration of graphics. These constraints arose from the fact that GPUs did not have as rich a programming ecosystem as CPUs. Their use, therefore, has been mostly limited to two dimensional (2D) and three dimensional (3D) graphics and a few leading edge multimedia applications, which are already accustomed to dealing with graphics and video application programming interfaces (APIs).
With the advent of multi-vendor supported OpenCL® and DirectCompute®, standard APIs and supporting tools, the limitations of the GPUs in traditional applications has been extended beyond traditional graphics. Although OpenCL and DirectCompute are a promising start, there are many hurdles remaining to creating an environment and ecosystem that allows the combination of a CPU and a GPU to be used as fluidly as the CPU for most programming tasks.
Existing computing systems often include multiple processing devices. For example, some computing systems include both a CPU and a GPU on separate chips (e.g., the CPU might be located on a motherboard and the GPU might be located on a graphics card) or in a single chip package. Both of these arrangements, however, still include significant challenges associated with (i) efficient scheduling, (ii) providing quality of service (QoS) guarantees between processes, (iii) programming model, (iv) compiling to multiple target instruction set architectures (ISAs), and (v) separate memory systems—all while minimizing power consumption.
For example, the discrete chip arrangement forces system and software architects to utilize chip to chip interfaces for each processor to access memory. While these external interfaces (e.g., chip to chip) negatively affect memory latency and power consumption for cooperating heterogeneous processors, the separate memory systems (i.e., separate address spaces) and driver managed shared memory create overhead that becomes unacceptable for fine grain offload.
In another example, the GPU, along with other peripherals (e.g., input/output (I/O) devices) may need to access information stored in memory of a computer system. For enhanced performance, the computer system can provide virtual memory capabilities for the I/O device. Accordingly, multiple I/O devices may request information based on corresponding virtual addresses, and the computer system translates the virtual addresses to a physical addresses corresponding to memory. An input/output memory management unit (IOMMU) may provide address translation services between the multiple I/O devices and memory.
The computing system can also provide multiple virtualized systems, including virtualized guest operating systems (OSes) managed by a hypervisor. In order to provide access to the I/O devices, the computer system can virtualize I/O devices for each guest OS. That is, the hypervisor manipulates memory by coordinating conversions from virtual memory addresses to physical memory addresses for each of the I/O devices for each of the virtualized guest OSes. This process is performed so that each virtualized system can access the I/O devices as though each guest OS was the only OS accessing the I/O devices.
Thus, the hypervisor can become a bottleneck as it executes software routines to accommodate all of the requests for address translations. Since each of these translations is associated with accessing the IOMMU, the software based operation of the hypervisor represents significant overhead. This overhead can degrade performance. Additional overhead is associated with organizing and manipulating memory structures in memory for efficient and concurrent access to memory translations to accommodate multiple I/O devices.