Images displayed on a computer or electronic device are often provided by use of an image processing accelerator that handles the extremely large computational load typically needed to provide 2D and 3D images in an efficient manner. This may include memory-intensive texture mapping, rendering of polygons, transformation of vertices into different coordinate systems, programmable shader support, and so forth, and as well as video decoding processes such as motion compensation, inverse discrete cosine transform, and so on. Thus, these accelerators may handle millions of vertices or pixels in a frame. The accelerator may have a graphics processing unit (GPU) for example to perform the calculations. These systems, however, may still have a relatively high latency where latency is the time from when a processor requests data from a memory until the data is received by the processor. During this latency, processor components may sit idle while waiting for data. Thus, without compensating for the latency, delays may occur while both the processor and memory bandwidth may be under-used such that it affects the performance of the device.
In order to reduce or hide the latency, the GPUs may use many (tens to hundreds) of parallel threads performing the same tasks to handle the high number of calculations so that computations for many pixels can be performed simultaneously. A thread is a sequence of instructions that is executed independently of other sequences. For example, this may include a processor that supports multi-threading time-slice of the execution of different threads on the same hardware. With this arrangement, when one thread is stalled due to a memory load, it may suspend operation, and a context switch may be triggered to have the processor work on another thread while waiting for the data for the first thread. This may hide the latency from a programmer since the arithmetic logic unit (ALU) may have less down time. Such a system, however, is often arranged with hardware to store thread states and includes registers assigned to each thread. Thus, where a GPU needs to have many ALUs each with many threads to hide latency, such a configuration may consume too much power or may be cost prohibitive to be used in smaller devices such as smartphones or tablets.