The present disclosure relates generally to data processing and, more particularly, to reduction of delays in data processing.
Mainframe computer systems, such as IBM's zSeries computing systems, have evolved into extremely useful systems, in large part because of their adaptability to changing needs of enterprises. These systems are typically “pipelined”. That is, multiple instructions are being executed at different stages at the same time. Thus, once a first instruction is fetched and decoded, a second instruction is fetched and becomes part of the “pipeline”. When the first decoded instruction proceeds to an address generation stage at which point operands are fetched, the second instruction is decoded, and a third instruction is fetched. Thus, multiple instructions may be active at various stages of the pipeline at any time.
Index registers are used for modifying operand addresses during the run of a program. An index register is a register used primarily for indexing into an array.
The flow of instructions into a pipeline may stall for many reasons. One such stall is referred to as “address generation interlock” (AGI). This occurs when one instruction updates a register being used by another register.
For example, if a first instruction modifies a register that a second instruction needs to calculate the address of operands, the second instruction may proceed to the address generation stage but end up being held until the first instruction updates the register that the second instruction needs. Only than may the second instruction complete its address generation and continue to progress in the pipeline.
AGI has become a problem for sophisticated computing systems, such as IBM's zSeries system. Attempts have been made at solving the problem of AGI, such as induction variable analysis, loop striding, unrolling and instruction scheduling. While all these attempts have been helpful, significant AGI delays are still present in today's highly optimized code.
On the current zSeries architecture, the C/C++ Specint and Java Specjvm98/Specjbb benchmarks spend a significant amount of time in address generation interlock (AGI) delays.
FIG. 1 illustrates an estimate of time spent in AGI delays for Speclnt benchmarks running on IBM's z990 system. As shown in the chart in FIG. 1, the AGI delays on highly optimized CPU intensive benchmarks range from 20% to over 40% of the processing time. Time spent on instruction performance and caching are also shown, for the purposes of comparison. These results were obtained through measurements of the IBM J9 Java Virtual Machine.
To understand the problem of AGI, it is useful to understand the structure of an array and how code is loaded from a register into an array. An array is formed of one or more memory units referred to as bytes. The “stride” of any array refers to the number of bytes between successive array elements.
To access an array, an index register needs to be shifted by some factor, (typically 1, 2, or 3 bits), to account for the offset in the array due to the stride of the array, i.e., the number of bytes in an array element. For example, for any array with a stride of 4 bytes, an index register needs to be shifted by 2.
As an example of code used for shifting an index register to access an array, consider the following zSeries code:
SLLRi,2LRy, 0 (Ri,Rz)
In this example, the index register Ri is shifted by 2 before a term is loaded from the register into the array. For details of zSeries code, the reader is directed to the zSeries Architecture Principles of Operation website, http://publibz.boulder.ibm.com/cgi-bin/bookmgr_OS390/BOOKS/DZ9ZR003/CCONTENTS?SHELF=EZ2ZO10E&DN=SA22-7832-03&DT=20040504121320/≧.
This shifting introduces a sizeable delay of the cycles on a modern processor, such as IBM's z990 processor. This problem is typically addressed in an optimizing compiler by “striding” the index register Ri, working with some multiple of Ri throughout the code instead of Ri to eliminate the need to shift. In the example above, Ri*4 would be used throughout the code instead of Ri, eliminating the shift. This cannot always be done, however, either because the underlying hardware does not have enough registers to prevent spilling of data into memory or because the striding optimization cannot be performed on a particular register.
Thus, there is a need for an improved technique for reducing delays in compiler optimization.