Graphics processing units (GPUs) are typically designed to facilitate fast and efficient execution of common graphics processing operations, for example geometric processing functions such as dot, cross and matrix product calculations on vector inputs. Since GPUs are typically optimised for such operations, they can typically complete these tasks much faster than a central processing unit (CPU) even if such a CPU includes SIMD (single instruction multiple data) hardware.
In a typical system-on-chip (SoC) environment, a CPU and a GPU may be coupled together via a bus infrastructure, with shared memory being utilised as a mechanism for the CPU to setup batches of operations to be performed by the GPU. Such a known arrangement is shown in FIG. 1, where a CPU 10 is coupled with a GPU 20 via a bus network 30, with shared memory 40 also being coupled to the bus network 30. It will be appreciated that the bus network 30 may incorporate one or more separate buses, and the shared memory 40 may or may not include one or more levels of cache.
The manner in which the CPU can setup a batch of operations for execution by the GPU is shown schematically by the arrows numbered 1 through 4 in FIG. 1, with the sequence of steps being illustrated in more detail by the flow diagram of FIG. 2. In particular, as indicated by arrow 1, and discussed at step 100 in FIG. 2, the CPU first stores one or more data structures to the shared memory 40. As will be understood by those skilled in the art, each data structure will have a predetermined format understood by both the CPU and the GPU, and the actual data provided within the data structure may identify not only data values on which the GPU is to operate, but may also identify instructions defining the graphics processing operations to be performed by the GPU. It will also be understood that whilst the instructions and data values may be specified directly in the data structure, the data structure may also include one or more pointers identifying memory addresses at which certain instructions and/or data values may be found.
As shown by the arrow 2 in FIG. 1, and illustrated by step 105 in FIG. 2, the CPU, in addition to storing one or more data structures in the shared memory, will also typically write various control information into one or more memory mapped control registers 25 within the GPU 20. Since the control registers 25 are memory mapped, they can be accessed directly by the CPU over the bus network 30 by the CPU issuing access requests specifying the relevant memory addresses. Via this route, certain basic control parameters of the GPU can be set under the control of the CPU 10. Typically, one of the control registers 25 will have a value stored therein identifying at least one data structure in shared memory to be accessed by the GPU in order to begin processing of the batch of graphics processing operations.
Once the memory mapped control registers have been set, and the relevant data structure(s) have been stored in the shared memory 40, the GPU will then begin its operation, using the information in the memory mapped control registers in order to begin retrieving the relevant data structure(s) from shared memory 40. As shown by arrow 3 in FIG. 1, and illustrated by step 110 in FIG. 2, this will cause the GPU to perform the required graphics processing operations as defined by the data structure(s), and typically the results will be stored back to shared memory 40 starting at a predetermined address.
When the GPU 20 completes performance of the batch of operations specified by the data structure(s), it will issue an interrupt to the CPU over the IRQ path 50, as shown by the arrow 4 in FIG. 1 and illustrated by step 115 in FIG. 2. On receipt of the interrupt, the CPU 10 will typically execute an interrupt service routine (ISR) in order to retrieve the result data from shared memory 40, whereafter that result data can be used by the CPU during the performance of subsequent operations by the CPU.
For common graphics processing operations, the GPU 20 can typically achieve a much higher throughput than would be the case if those operations were instead performed on the CPU 10, and hence the use of the GPU can significantly increase performance of the overall system. However, with reference to the above description of FIGS. 1 and 2, it will be appreciated that there is a significant setup time involved in setting up the operations to be performed by the GPU, due to the need to communicate through the construction of data structures created in shared memory, along with the need to program up the necessary memory mapped control registers 25 of the GPU 20. This high latency is not generally considered an issue for normal graphics operations that can be formed into sufficiently large enough batches for the high latencies involved to be compensated for by the throughput performance benefit achieved by offloading that work from the CPU to the GPU.
However, there are other operations currently performed by the CPU that could potentially be performed efficiently by the GPU, but where the high latency involved in setting up the GPU to perform the operations makes it impractical to use the GPU. For example, it is common during the execution of graphics and gaming code on the CPU that relatively small pieces of code are repeated multiple times in sections of the inner loops of program code, examples being in physics based animation, artificial intelligence code for path finding in 3D worlds, or determining visible objects for artificial intelligence constructs. The execution of such code is typically time critical. Whilst the operations or groups of operations defined by such code could in principle be accelerated by the use of the GPU, they tend to comprise relatively small code sections (in terms of the number of GPU operations that would be required once the code has been mapped to the GPU) and involve relatively small amounts of data (for example one or two matrices and a number of vectors). Typically, it is difficult to arrange for these operations to be performed in sufficiently large batches to overcome the latencies involved in writing out data structures to shared memory, having the GPU perform the necessary operations followed by the issuance of an interrupt, and then have the CPU respond to the interrupt in order to read the relevant results.
Such factors tend to prohibit the CPU taking advantage of the GPU's processing capabilities for the above types of operations, particularly since the CPU is often unable in such instances to compensate for the high latency introduced by using the GPU (the CPU code following the offloaded operation, or group of operations, will typically be heavily dependent on the result of the offloaded operations).
However, for the types of graphics processing operations that the GPU is traditionally used for, it is observed that the available hardware resources of the GPU are not fully utilised all of the time, and hence the GPU is likely to have spare processing capacity.
Accordingly, it would be desirable to provide an improved technique for communication between the CPU and the GPU, which allows the GPU to continue to perform existing graphics processing operations, but also facilitated the offloading of other, less latency tolerant, operations to the GPU.