1. Field of the Invention
The present invention relates generally to the data processing field and, more particularly, to a computer implemented method, system and computer usable program code for simulating processor operation in a data processing system.
2. Description of the Related Art
Trace sampling is motivated by the need for an instruction trace that is short enough to complete on a cycle-accurate processor model in a reasonable period of time, which is often not possible if a full trace is run. For example, many programs used for benchmarking and performance projections have pathlengths of hundreds of billions to trillions of instructions or more. Given that a cycle-accurate processor model may run at about 10K instructions per second, a 1T instruction trace would require over three years to complete. Accordingly, it is necessary to use trace samples (or, for execution-driven simulators, checkpoints) for cycle-accurate simulation.
Using trace samples in lieu of a full trace for a simulation, however, presents the problem of representativeness—that is, the trace samples may not have the same performance characteristics as the full trace. If a trace sample is not representative  within a small margin of error (in practice, about five percent is the maximum tolerable error, and a one percent error is a more ideal limit), the trace samples will not be useful for making performance predictions and design decisions. Therefore, it is necessary to have an effective mechanism for determining which parts of a full trace should be used as trace samples.
Considerable effort has been directed to creating and improving mechanisms for measuring the representativeness of trace samples and for taking representative trace samples. One known metric for measuring the representativeness of trace samples is called “R-Metric”. This metric has been used to measure the representativeness of trace samples at uniform intervals. A limitation of R-Metric, however, is that although it measures the representativeness of a given trace sample, it does not provide a mechanism for determining the most representative trace sample out of a set of all possible trace samples, unless all possible samples are taken and their R-Metrics are compared, which is impossible since the set of possible samples is intractably large. Accordingly, a trial-and-error approach is required in which a sample is taken, its R-Metric is measured, and, if the R-Metric is not below a user-determined maximum, another sample is taken and the process is repeated until a sample with a below-maximum R-Metric is found, which may not happen in a reasonable amount of time, and may never happen. 
Two newer sampling mechanisms, called “SMARTS” and “TurboSMARTS”, are mechanisms based on statistical sampling in which thousands of small periodic samples of a trace are taken, and then either run serially with the simulator switched to a faster functional mode for the non-sampled instructions in order to warm the machine state (SMARTS), or run in parallel with checkpoints to create warmed machine states (TurboSMARTS).
The SMARTS and TurboSMARTS mechanisms result in very representative samples (0.64 percent average CPI error on a known system), but have a disadvantage in speed.
In particular, SMARTS requires running all of the non-sampled instructions through a functional simulator to warm machine state, which could require weeks of simulation for runs of longer benchmarks. TurboSMARTS allows for a much faster simulation by breaking each sample into thousands of small pieces and running them in parallel, however, it is still necessary to create checkpoint files once for each performance binary, which may require weeks for each benchmark. These approaches may be satisfactory for research purposes; however, in a production/development environment, where very tight schedules must be adhered to and where compiler tuning occurs in parallel with hardware development, resulting in new benchmark binaries every week, a trace sampling mechanism that will allow a new trace sample to be created in a number of days is preferred. 
A promising automated clustering-based method for trace sample selection that has recently been proposed is known as “SimPoint.” SimPoint works by clustering, or grouping, intervals of a trace based on the code profile of each interval, which is represented by a basic block vector (BBV) for each interval. By clustering intervals by BBV, SimPoint aims to sort intervals by their code profiles into phases, where each cluster represents one phase of execution. The assumption here is that there is a strong correlation between the code executed during an interval and the performance characteristics of that interval. The trace sample generated by SimPoint comprises one interval from each cluster, with the goal being that these intervals will represent all the different phases of execution; and thus (if each interval's CPI is weighted by the size of its cluster) constitute a representative trace sample for a simulation.
SimPoint-style methods for trace sample selection typically break a dynamic instruction trace into intervals of uniform length and select a fewest number of intervals that together exhibit performance similar to the full trace. SimPoint has been shown to produce trace samples with a reasonably small error in CPI compared to full traces, for example, about three percent.
A three percent error, however, is not insignificant and there is room for improvement in SimPoint-style trace sample selection methods. A central disadvantage of SimPoint-style methods is that  a trace is divided into intervals along arbitrary boundaries, which may or may not correspond to boundaries between actual phases of execution. A phase of execution is a segment of a dynamic instruction trace that exhibits unique and stable performance characteristics (principally CPI, but also cache miss rates, branch misprediction rates, etc.). A change in dynamic performance characteristics corresponds to the end of one phase and the beginning of another, i.e., a “phase boundary.” If SimPoint interval boundaries are not aligned with phase boundaries, phases will be divided among multiple intervals and mixed with instructions from other phases, thus eroding or distorting the difference between intervals as seen by a clustering algorithm. Thus, with fixed-length intervals, a clustering algorithm may not result in clusters that correspond to phases, and this can result in trace samples that are less than optimally representative.
In order to address the problem of unaligned interval boundaries, a variable-length interval version of SimPoint has been proposed that automatically creates intervals of variable length using a method derived from language processing algorithms to determine the length of intervals based on patterns of loop, call, and return events. However, since with this method variable length interval boundaries are determined solely by events in the instruction stream, this method, like SimPoint with fixed-length intervals, assumes a strong correlation between code profile and phase behavior. 
Also, with this variable-length interval version of SimPoint, interval boundaries can only occur on calls, returns, and at the beginning or end of a loop; however, phase boundaries may occur in between these events. Yet further, the method may result in an average CPI error of about two-percent, but only if the total sample is over 4B instructions in length. For a modeling environment such as one in which each workload needs to be represented as a single 100M-instruction serial trace, which is the case in some processor modeling environments, a 4B-instruction trace is intractable because it takes too long to simulate.
There is, accordingly, a need for a mechanism for selecting highly representative trace samples in a clustering-based trace sample selection mechanism used for selecting trace samples for simulating processor operation in a data processing system. 