The disclosure relates generally to methods and apparatus for regulating load imbalances among processing cores, such as processing cores within a Graphics Processing Unit (GPU), Central Processing Unit (CPU), or other processing cores. In general, processing workloads (e.g. one or more software threads needing to be executed) processed by processing devices, such as GPUs, may present load imbalance, whereby a first processing core may be busy executing assigned software threads (e.g. sequences of programmed instructions) while a second processing core may be idle. In such a situation, the overall processing power of the processing device(s) is not fully utilized, as the second processing core is not processing available work (e.g. software threads that may be waiting to be executed by the first processing core). Furthermore, software threads to be executed on the first processing core may need to wait for one or more other software threads that are or will be executing on the first processing core to complete before being executed on the first processing core. Accordingly, shorter running software threads may have to wait for longer running software threads to complete before being executed on the first processing core, instead of utilizing the computational resources and power available on the second processing core. Thus, to achieve higher levels of a processing device's performance, such as within a GPU, it is desirable to distribute processing work among various processing cores efficiently such that no processing core is idle (e.g. not executing instructions) while another core has a backlog of work to be processed.
Current solutions provide dynamic ways of rebalancing processing workloads to achieve higher levels of GPU performance. For example, the work donation method attempts to improve workload imbalances among execution units, such as Single Instruction Multiple Data (SIMD) units, within a processing core (e.g. intra-core). The execution units typically execute in lockstep, where each one is associated with one or more workgroups. A workgroup includes one or more wavefronts, whereby a wavefront is a collection of software threads that execute on the same execution unit in lockstep. A workgroup may include multiple wavefronts associated with different execution units, such that wavefront software threads from one wavefront execute on one execution unit, and wavefront software threads from another wavefront execute on another execution unit within the same processing core.
The work donation process allows a workgroup associated with a particular processing core to donate unprocessed workloads (e.g. a software thread waiting or needing to be executed) to another workgroup associated with the same processing core, such that unprocessed workloads may be transferred from one execution unit to another. For example, workgroups may hold unprocessed workloads in the form of workgroup queue elements (e.g. pointers to instructions and tasks needing to be executed) within workgroup queues. To donate workgroup queue elements from one workgroup queue to another, a donation queue may be used such that workgroup queue elements are donated from one workgroup queue to the donation queue. Another workgroup queue may then obtain those donated queue elements from the donation queue, resulting in a transfer of unprocessed workloads from one workgroup queue to another. As such, the work donation process attempts to alleviate load imbalances among the various execution units within a processing core using workgroup queues.
To carry out the work donation process, operations such as reads and writes to memory may be required, for example, when implementing workgroup queues in memory. These operations may be performed within a processing core scope, i.e., where reads and writes to memory accessible to software threads executing on a particular core are synchronized among those software threads. For example, workgroup queues associated with a particular processing core may be available in memory, such as L1 cache memory, that is accessible to software threads executing on a particular processing core, but not to software threads executing on other processing cores. As an example, the L1 cache memory may store, or contain the latest state of, the workgroup queues associated with a particular processing core. As such, to maintain data integrity, reads and writes to the same area of L1 cache memory must be synchronized across all software threads executing on a particular core of a processing device accessing the same data in the L1 cache memory.
A different process, work stealing, attempts to improve workload imbalances among different processing cores of a particular device by providing for the stealing of donation queue elements from one donation queue associated with one processing core to another donation queue associated with a different processing core. For example, a processing device may include multiple processing cores, where each processing core may have one or more donation queues associated with workgroups executing on that particular processing core (e.g. software threads executing on that processing core). The processing cores are able to submit queue elements to, and obtain queue elements from, their respective donation queues. As such, the work stealing mechanism allows for the transfer of unprocessed workloads from a donation queue associated with a workgroup executing on one processing core to a donation queue associated with a workgroup executing on a different processing core of the processing device. Thus, each processing core may steal (e.g. obtain) unprocessed workloads from the other.
To carry out the work stealing process, operations such as reads and writes to memory may be required, for example, when implementing donation queues in memory. These operations are typically performed within a device scope, i.e., where reads and writes to memory accessible to various software threads executing on different cores on the same device are synchronized among those software threads. For example, donation queues associated with a particular processing core may be stored in memory, such as L2 cache memory, that is accessible to software threads executing on various processing cores of a processing device. As such, to maintain data integrity, reads and writes to the same area of L2 cache memory must be synchronized across all software threads executing on all cores of the processing device that may access the same data in the L2 cache memory.
These methods of rebalancing unprocessed workloads, however, suffer inefficiencies that prevent optimal processing device performance. For example, in high software thread count situations, work donation systems suffer from high software thread contention to data stored in local data storage, such as data stored in L1 cache memory that is accessible only to software threads executing on a particular processing core. Work stealing may suffer similarly in high software thread count situations, and may also suffer from the overhead costs associated with stealing unprocessed workloads from software threads running on different cores (i.e. remote software threads). Thus, there is a need to improve load imbalances in processing devices.