Many proposals for dealing with memory latencies hide memory latency with the use of context or fiber switching. Memory latencies may be especially problematic in intensive processing environments such as computer graphics. This may become especially significant if software rendering is employed rather than specialized graphics hardware.
In context or fiber switching schemes, a context or fiber, which initiates a long latency read, is idled and another context runs until the data arrives. However, in code where the ratio of reads to processing is relatively high this may result in a high number of idled contexts. Since each context requires support for a register set the cost of supporting such a scheme is high. However, streaming processing can often be broken into stages, which do not feed back to each other. In other words, the code looks as follows:
for (many iterations) {    fetch ‘a’    do processing with ‘a’    fetch ‘b’    do processing with ‘b’    fetch ‘c’    do processing with ‘c’    etc.}
In order to optimize the code execution, it is possible to unroll such a loop. Loop unrolling is an algorithmic technique that attempts to increase a program's execution speed (typically at the expense of its size) by reducing (or eliminating) the number of “end of loop” tests performed on each iteration. If the performance gained by eliminating the “end of loop” test offsets the performance reduction caused by the increased size of the program, performance gains may be realized. Furthermore, performance gains achieved via the use of parallelism may be realized if statements in the loop are independent of one another.
However, loop unrolling is associated with several disadvantages. The overhead of loop unrolling may include increased register usage to store temporary variables and increased code size (which may increase the number of cache misses).
In general, the level of unrolling needs to be high enough to account for the high end of the latency. Unfortunately, increasing the level of unrolling increases code complexity, increases the minimum number of iterations which can be executed efficiently, wastes more performance at the start and end of a group of processing and wastes performance when the actual latency is lower than that accounted for by the unrolling
Many combinations of unrolled loops are required for different combinations of processing. For example, suppose there are 4 possible state settings associated with ‘a’, 3 state settings associated with ‘b’ and 5 possible state settings associated with ‘c’. Then the number of combinations of processing ‘a’, ‘b’ and ‘c’ is 60. This means that 60 different pieces of code could be required to efficiently unroll any the processing associated with any set of state settings. All processing for ‘a’, ‘b’, ‘c’ still resides within one thread, limiting scheduling and parallelism opportunities.
Another approach to alleviate memory latencies is to utilize a parallel processing architecture in a multiple processor (“MP”) environment. However, the communication between processors in a MP presents significant efficiency and overhead challenges. For example, to realize the potential of parallel computing, the distribution and coordination of computational tasks between multiple processors must be realized. If two processors respectively read and write from a shared memory area, read and write pointer information must be communicated between the two processors to prevent buffer overflow or underflow. Communication of this type of information requires additional communication and processing overhead for each processor involved in a parallel computation. Although significant computation gains may be achieved via parallelism, these gains are often mitigated by the processing overhead required to allow communication between multiple processors.
In traditional scenarios, the transfer of data from a producer to a consumer is demand based. Known approaches for handling communications between processors requires a significant amount of abstraction and software processing overhead. For example, the communication between processors may be handled by semaphores and in this instance the entirety of flow control must be handled by software instructions.
It is known that separate processing stages are utilized in GPUs (“Graphics Processing Unites”) with FIFOs (“First In First Out”) data structures in-between them. The typical stages are command processing, vertex fetch, vertex processing, pixel processing and frame buffer fragment processing. However, these stages have been defined by the hardware architecture of the GPU. This means that the processing stages and FIFOs are fixed. In some variations processing within GPUs is done in fewer stages, referred to as shaders, which do more complex processing. In these cases GPUs use a large number of threads to overcome read latencies incurred when shaders read from memory. Other processing systems have similar issues. In addition, some processing systems force programmers to combine the code for many stages into a single block of code to eliminate communications between processors.
Thus, a flexible approach for addressing memory latencies is required suitable for use in a processing intensive environment such as computer graphics that doesn't suffer from the limitations of single thread processing using a single thread unrolling approach or the use of FIFOs tied directly to hardware structure is necessitated.