1. Field of the Invention
The present invention relates generally to the data processing field and, more particularly, to a two-level representative workload phase detection method, apparatus, and computer usable program code.
2. Description of the Related Art
Modern software workloads, such as the SPEC2006 benchmark, can have dynamic instruction pathlengths of many trillions of instructions for a single dataset. A “dataset” is the input data consumed by the program. For example, fully executing h264ref, from the SPEC2006 benchmark, with its third input dataset has 3.2 trillion dynamic instructions. Indeed, most programs exhibit more than one trillion dynamic instructions.
Trace-driven simulators are used that simulate the behavior of a particular processor unit design. A processor unit includes one or more processors along with one or more caches, such as L1 and L2 caches. In order to assess design changes and project workload performance for processor units that are being designed, the simulator would ideally be used to execute the entire workload. However, this is not feasible.
These simulators execute on the order of 10,000 instructions per second on modern machines. Therefore, for a program with 1 trillion dynamic instructions, simulation would take on the order of 3.1 years to complete. Because of the performance effect of warmed-up caches and processor state, the simulation would have to be done serially on a single processor unit to correctly represent the performance of the processor unit. If some accuracy is sacrificed, the instruction sequences can be split onto multiple processors, but still millions of instructions must be executed to warm up the processor prior to collecting performance results on the subset of instructions.
To reduce the number of instructions executed, the workload, i.e. the dynamic instruction sequence, can be sampled at periodic intervals, and those instructions concatenated into a trace that is used instead of the entire workload. The trace, instead of the entire dynamic instruction sequence, is then fed into the trace-driven simulator in order to assess a particular processor unit design. Generating a trace automatically incorporates the machine effects of a particular input dataset.
Dynamic instructions in a workload often exhibit phases of execution, i.e. repetitive sequences of instructions, that correlate strongly to the basic blocks being executed by a program.
A promising automated clustering-based method for trace sample selection that has recently been proposed is known as “SimPoint.” SimPoint is an example of clustering software that takes a workload and some user-defined parameters as an input, and generates a “clustering” that includes a plurality of clusters. The resulting clustering is considered to be representative of the entire workload within an error tolerance. The trace of instructions that best represent the clusters can then be executed rapidly by a trace-driven simulator in order to assess a particular processor design.
The clustering software works by clustering, or grouping, intervals of the workload based on the code profile of each interval, which is represented by a basic block vector (BBV) for each interval, to produce a plurality of clusters. The BBV consists of the frequencies that basic blocks appear in the interval weighted by the numbers of instructions in the basic blocks. By clustering intervals by BBV, the clustering software aims to sort intervals by their code profiles into “phases,” where each “cluster” represents a phase of program execution. A phase is an ideality, i.e. it is the perfect cluster that represents a true phase of the program execution perfectly. A cluster is often an imperfect representation of a phase, as the phase can sometimes be split between two or more clusters if the clustering software is not allowed to run long enough or work hard enough to determine that the intervals all belong to one cluster.
Simply using the clusters of basic blocks does not, however, take into account the effects on the performance of a particular processor unit design when a program and input dataset is executed on the particular processor unit design. The BBVs do not take into account the input dataset values and data footprint. The data footprint of a program includes the actual cache memory and main memory instruction and data access patterns.
The prior art does not incorporate the machine-specific characteristics related to how a particular processor unit design executes particular input datasets into the process of generating a clustering that best represents the performance of a particular processor unit and system design when the particular processor unit design is executing a particular workload. As a result, the clustering often does not capture the dataset characteristics of the workload on either the processors or the caches found in a particular processor unit design. This reduces the accuracy of simulation studies to assess design changes and to project performance.