Trace-driven, cycle-based simulation is a popular methodology used for architectural performance evaluation of microprocessors and microprocessor-based systems. Such a methodology can estimate benchmark program performance years before actual hardware is available. It also enables designers to understand, analyze and selectively remove the performance bottlenecks in the target microarchitecture.
A problem with such simulation-based methodologies is that the input workload "trace" can be millions or even billions of instructions long, and the speed of simulating the full processor-memory sub-system is often low enough that it could take weeks (or even months) to evaluate a single design point for a single workload. Given that in a processor development environment, analysis results across an entire suite of workloads are needed on at least a weekly basis, it is apparent that the state of the art in trace-driven architectural simulation is quickly becoming inadequate in meeting the demands of increasingly complex microprocessor designs.
In cycle-based simulation, each pipelined resource (i.e., functional unit, cache/memory access, bus transaction, queues/buffers, etc.) is serviced during every simulated cycle of the target processor. A common term used for such a software cycle-by-cycle simulation program is: "timer". (Henceforth, in this document, the terms "timer" and "cycle-based processor simulation program" shall be used interchangeably.) Since the detailed instruction flow through all concurrent, pipelined resources is modeled in a timer, the simulation speed is inversely related to the complexity and detail of the model; this complexity is approximately proportional to the maximum number of outstanding instructions (and hence distinct pipeline and buffer stages) supported by the microprocessor system (as modeled in the timer) on a given machine cycle. Also, the total simulation cost (time) increases with the length of the trace. In other words, the total simulation cost C is the product of the average simulation speed (rate) S, measured in seconds/instruction, and the trace length, measured in instructions. If the trace is encoded in binary form to save disk space on the simulation host machine, the speed S may be measured in seconds/byte, and the trace length is then expressed in bytes. The complexity of the model also depends on the amount of work (which can be measured in number of host machine instructions executed) needed to be done for processing an instruction passing through a given pipeline or buffer stage. Thus, not all stages require an equal amount of simulation work; so a simple count of the total number of stages is not enough. A weighted sum of the number of stages, where each weight gives a measure of the simulation work needed for that stage, would give a better measure of the worst-case model complexity.
As such, there is a need in the art for reducing the simulation cost (time) for a given workload trace.