1. Field of the Invention
The present invention relates to out-of-order processors and more particularly to computing overhead for out-of-order processors.
2. Description of the Related Art
It is relatively straightforward to determine execution time that an instruction spends in an in-order processor. A younger instruction is issued only after all older instructions have been issued and retired (i.e., completed). Sampling a Program Counter (PC) at a given interval provides statistical time spent on each instruction by comparing when the instruction completes execution (i.e., retires) against when the instruction started execution using the PC. For example, FIG. 1, labeled Prior Art, shows a sequence of three instructions. The first instruction takes 10 cycles to execute, the second instruction starts executing when the first instruction retires and takes 5 cycles to execute and the third instruction starts executing when the second instruction retires, takes 15 cycles to execute and retires after a total of 30 cycles from the beginning of the first instruction to the retiring of the third instruction. Thus, the first instruction uses 10/30 (33.3%) of the total execution time, the second instruction uses 5/30 (16.6%) or the total execution time and the third instruction uses 15/30 (50%) or the total execution time.
However, determining execution time for an instruction when the processor is an OOO (out-of-order) processor is more difficult. When instructions are issued out-of-order, there is no guarantee that a younger instruction is issued after all old instructions are issued and retired. Also, multiple outstanding transactions to memory and parallel replays and rewinds make it difficult to compute the overhead in a program. For example, determining that a program has 12% of total clock cycles attributable to Level 2 cache misses does not provide much insight into what percentage of the total elapsed time is attributed to the Level 2 cache misses. Of the 12% total clock cycles, it is possible that more than 6% of the total clock cycles are attributable to one L2 cache miss.
FIG. 2, labeled Prior Art, shows an example of this issue. In the FIG. 2 example, the first instruction starts executing at clock cycle t and retires at clock cycle t+10. The second instruction starts executing at clock cycle t+2 and retires at clock cycle t+25. The third instruction starts executing at clock cycle t+2 and retires at clock cycle t+30. Thus, the first instruction uses 10/30 (33.3%) of the elapsed time, but 10/61 (16.4%) of the total execution cycles. The second instruction uses 23/30 (76.6%) of the elapsed time, but 23/61 (37.7%) of the total execution cycles. The third instruction uses 28/30 (93.3%) of the elapsed time, but 28/61 (46.6%) of the total execution time. The percentage of total elapsed time is the overhead computation that is desirable to determine. However, this is the computation that is difficult to determine with OOO processors.