Many applications spend a large percentage of time in small kernels of instructions that are executed in tight loops. Loop buffers may optimize loop execution in processors that have limited instruction cache and strict power limitations. Specifically, the code kernels that are executed in tight loops are loaded in a loop buffer that is usually much smaller than the instruction cache. Hence, the energy and overhead in fetching instructions is reduced when the loop buffer is utilized. This makes loop buffers useful in power efficient architectures targeted to execute applications that have identified the loop kernels. However, loop buffers are typically very small, limiting their support of code expansion that occurs with loop unrolling.
For example, the tables below illustrate different approaches for evaluating a loop. Table 1 includes an example code snippet that converts all bits in input-integer into bytes in array output_array.
TABLE 1C-Code snippetfor (i=31; i>=0; i−−){        output_array[i] = input_integer & (1 << i);}
Table 2 below illustrates an example assembly snippet, herein referred to as the “original assembly snippet.” This assembly snippet may be generated by compiling the code snippet in Table 1. No loop unrolling or loop buffer is used, resulting in extra instructions to track the number of iterations and to execute the conditional branch.
TABLE 2Original assembly snippetmov $r0, 31  ## $r0 = 31ld $r1, &input_integer ## $r1 = input_integerL1:shll $r2, 1, $r0 ## $r2 = 1<<i and $r3, $r1, $r2  ## $r3 = input_integer & (1 << i) str r3, output_array[$r0]  ## output_array[i] = $r3sub $r0, $r0, 1 ## i−−;jge %r0, 0, L1## if i>=0 iterate through loop again
Tables 3 and 4 below illustrate alternative assembly snippets generated from the same code snippet. Table 3 exploits loop unrolling resulting in code-size expansion, and Table 4 exploits a loop buffer. By expanding the code as illustrated in Table 3, the instructions for tracking the number of iterations in the loop and evaluating the conditional branch may be eliminated. However, such code may be too large to be stored in a loop buffer.
TABLE 3Assembly snippet that exploits loop unrollingld $r1, &input_integer  ## $r1 = input_integer and $r3, $r1, #0x80000000 ## $r3 = input_integer & (1 << 31) str r3, output_array[#31] ## output_array[31] = $r3and $r3, $r1, #0x40000000 ## $r3 = input_integer & (1 << 30) str r3, output_array[#30] ## output_array[30] = $r3and $r3, $r1, #0x20000000 ## $r3 = input_integer & (1 << 29) str r3, output_array[#29] ## output_array[29] = $r3:::and $r3, $r1, #0x00000001## $r3 = input_integer & (1 << 0) str r3, output_array[#0]## output_array[0] = $r3
In Table 4, $r0 is the LOOPCOUNT register and is auto decremented by loop buffer hardware. The loop is completed when $r0 reaches 0. The number of instructions in this example compared to loop unrolling is significantly reduced, which facilitates storage in a loop buffer when space is limited.
TABLE 4Assembly snippet that exploits loop buffermov $r0, 32 ## $r0 = number of iterations = 32ld $r1, &input_integer ## $r1 = input_integermov $r5, 31 ## $r5 = i = 31loopstart $r0 shll $r2, 1, $r5 ## $r2 = 1<<i and $r3, $r1, $r2## $r3 = input_integer & (1 << i) str r3, output_array[$r5]## output_array[i] = $r3 sub $r5, $r5, 1 ## i−−;loopend
Table 5 illustrates the number of instructions executed in each representation of the code. The number of instructions that are executed may approximate the speed at which the code is executed. Accordingly, loop unrolling assembly was approximately 150% faster than the original assembly snippet while the assembly utilizing the loop buffer was approximately 25% faster than the original.
TABLE 5Comparison of assembly snippetsSpeedup as# of instructions# of instructionscompared toexecuted in 1executed in 32originaliteration of loop.iteration of loop.assembly snippetTable 2:51601OriginalAssemblysnippetTable 3:2642.5Assemblysnippetthat exploitsloop unrollingTable 4:41281.25Assemblysnippet thatexploitsloop buffer
As apparent from Table 5, loop unrolling may achieve greater speedup times than using a loop buffer. However, the loop unrolling assembly snippet does not take advantage of the reduced energy and overhead of fetching instructions provided by the assembly snippet that exploits the loop buffer. Although the loop unrolling may be used in conjunction with the loop buffer, the code expansion inherent in loop unrolling may overflow the loop buffer, reducing or negating the benefit of a loop buffer.
The approaches described in this section are approaches that could be pursued, but not necessarily approaches that have been previously conceived or pursued. Therefore, unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section.