Early microprocessors worked on just a single instruction at once, so an instruction would be fetched from memory, decoded, executed with an Arithmetic Logic Unit, and the results stored back to registers or memory, before the whole process was then repeated for another instruction.
To allow increases to processor clock speed, modern microprocessors use a feature called a pipeline and they have many instructions proceeding through the various stages of fetch-decode-execution-write at the same time. This allows each stage to involve simpler (and therefore faster) logic, and for some stages that involve longer operations to contain multiple sets of logic so that multiple instructions can be at this stage at the same time. It is not uncommon for modern processors to have pipelines that are ten or more stages deep.
Another feature of modern microprocessors is a trace unit. A trace unit reports the microprocessor's progress through the stream of instructions and writes the data to memory and/or exports it to external equipment via a suitable peripheral interface on the processor. In some cases the information reported by the trace unit is exported immediately, in some cases with a delay due to internal buffering within the microprocessor or in the microprocessor's external memory.
Deep multi-stage pipelines yield performance benefits where there is a stream of instructions, one after another, with no branches or loops within the logic of the program. However, where there is a branch, it becomes very difficult to ensure that the pipeline remains usefully occupied, and there are cases where a large amount of work-in-progress within the pipeline has to be discarded.
Many solutions to this problem have been applied in the past, including complex branch prediction logic, speculative processing of instruction streams along both possible paths after the branch, and many others, often with multiple approaches being used within the same microprocessor.
For example, one simple solution for the case where a short string of instructions (including the case of just a single instruction) are either executed or skipped depending on the some pre-existing state within the microprocessor, is the use of conditional instructions. These instructions have an additional field to indicate the conditions under which they should, or should not, be executed. However, this condition, or conditions, may not have been set to the final condition test result at the time at which the instruction enters the microprocessor's pipeline. Therefore, conditional instructions are allowed to proceed down the pipeline as normal, including the execution stage, but just before the results of the instruction (including any new internal microprocessor state) is stored, a check is done for the condition or conditions under which the instruction should be executed, and only if these conditions are met are the results written to the final destination. This final storage of the results is often called “completion” and an instruction that has proceeded though this stage is said to be “complete” or “completed”. For cases where the string of instructions that do not complete is shorter than the depth of the pipeline, the overall throughput of the processor will increase.
When conditional instructions are used at the same time as deep pipelines and trace units, there are a couple of standard, alternative approaches to reporting the instructions, neither of which is entirely satisfactory. The conditional instruction can be always reported, which will reduce the accuracy of the data and potentially confuse the user. Alternatively, reporting of the instruction can be suppressed entirely if it does not complete. Which approach is used often depends on where within the pipeline the trace unit is connected. If it is connected early within the pipeline, all instructions will be reported, whether completed or not. If it is connected at the final completion stage, it is easy to report only completed instructions, but sometimes a lot of information regarding the instruction has been discarded at this stage, which makes the information reported by the trace unit less useful.
Similarly, difficulties arise in analysing trace data when programs are interrupted or branch instructions speculatively fetched but not executed. There is therefore a need to be able to provide trace data that is straightforward to analyse and to have the flexibility to include or discard parts of the trace data as required in the circumstances.