1. Field of the Invention
The present invention relates in general to computer processing and more specifically to a system and method for software optimization of loop regions in a software code.
2. Description of the Prior Art
Modern compilers perform a lot of optimizations related to loops, which are regularly repeated regions in a source program. Commonly known optimizations such as loop unrolling and software pipelining may be included in a loop optimization system. For example, in Explicit Parallel Instruction Computing (EPIC) architecture processors, software pipelining is very important for loop optimizations.
Software pipelining is a well known optimization technique typically applied to loops. Software pipelining extracts potential parallelism from adjacent iterations of the loop for pipelining. Unlike loop unrolling, processors such as an EPIC processor do not make a copy of several adjacent iterations of an original loop to achieve more parallel code. Rather, an iteration is broken into several pipeline stages, S, which are combined into a parallel kernel code. Thus, the kernel contains only a set of operations from the original loop iteration. By executing the kernel once, S adjacent iterations are concurrently advanced, but in different stages.
An Initiation Interval (IN) of pipeline stages may be expressed in target architecture clock cycles. When performing a pipelined loop kernel code, during every IN clock cycles, a process starts a new iteration (i), advances (i−1) . . . (i−S+1) iterations, and finalizes an (i−S) iteration. In order to execute one iteration of an initial loop, S stages or S*IN clock cycles are needed. Further, in order to execute two iterations—S+1 stages or (S+1)*IN clock cycles are needed and so on. In general, execution time of a pipelined loop is equal to (N+S−1)*IN clock cycles, where N is a repetition count of the original loop. When the repetition count is large, the most time is consumed by N*IN, but if the repetition count is small and a loop is frequently visited, then (S−1)*IN becomes significant.
FIG. 1 illustrates an example loop schedule 100 for a source code 102. As shown, source code 102 includes three loop operations. The operations include a memory read (Ld), an addition (Add), and a memory write (St) operation. A processor, such as an EPIC architecture processor, with one memory channel (MEM) and one arithmetic logic unit (ALU) channel may perform the loop. During each clock cycle, it is assumed the processor is able to perform two parallel operations, one memory access operation and one arithmetical operation. As shown in table 104, latencies for the operations are as follows: Ld-five clock cycles, Add-two clock cycles, and St-one clock cycle. Accordingly, without pipelining, each iteration of the loop requires eight clock cycles. That is, five clock cycles for the load operation, two clock cycles for the add operation, and one clock cycle for the store operation. Schedule 100 and diagram 108 illustrate the operations and latencies of the loop. Thus, full execution time is T1=8*N, where N is a loop repetition counter.
FIG. 2 illustrates a typical loop software pipelining optimization of source code 102. For discussion purposes, it is assumed the same resources used in FIG. 1 are used in FIG. 2. Using the loop software pipelining optimization method, schedule 200 and diagram 202 are produced. As shown in table 204, a pipeline includes S1=4 stages and an initiation interval of 2 cycles. Also, an execution time is T2=(N+S1−1)*IN=(N+3)*2, hereinafter described.
For the loop, the initiation interval (IN) is 2 clock cycles because the loop includes two memory access operations and there is only one memory channel in the processor. Accordingly, diagram 202 illustrates a load operation at the clock cycles of 0, 2, 4, and 6. Additionally, a store operation at clock cycle 5, an Add operation at clock cycle 5, a store operation at clock cycle 7, etc. are shown. The loop kernel includes clock cycle 6 and 7, where store, add, and load operations are performed. Thus, a loop kernel includes 4 pipeline stages, so S1 equals 4. Specifically, the kernel includes the store operation performed in clock cycle 7 of iteration 1, the add operation is performed in clock cycle 7 of iteration 2, and the load operation performed in clock cycle 6 of iteration 4. As discussed above, to perform N iterations of the original loop, a kernel is executed (N+S−1) times. In terms of clock cycles, the execution time, T2, is equal to (N+S1−1)*IN. Assuming S1 equals 4, T2=(N+4−1)*2=(N+3)*2. Therefore, where N=1, the execution time of T2 is similar to the execution time of T1. However, for all N>1, the execution time of T2 is faster than the execution time of T1.