1. Field of the Invention
The present invention relates to data processing and in particular to mechanisms for generating trace data that captures operation of a data processing apparatus.
2. Description of the Prior Art
It is known to perform a diagnostic analysis of operation of a data processing device using trace data generated during execution of a sequence of program instructions. The complexity of modern data processing apparatuses such as microprocessors means that tracing and debugging operation of these data processing apparatuses is a complicated and time-consuming task. Many contemporary data processing apparatuses are configured as small-scale devices such as Systems-on-Chip (SoC). There are constraints in fabrication of such small-scale devices since opportunities for adding monitoring components are limited due to space constraints on the integrated circuit itself and the pins on the periphery of a SoC are also at a premium. This constrains the amount of diagnostic data that can be exported from the SoC for external analysis.
Furthermore, the volume of trace data generated upon performing a full instruction stream and data stream trace becomes prohibitive as the frequency of operation of processor cores increases and as the use of multiple cores on a single device becomes more common place. For example, for existing ARM processor cores and ETM protocols, a bit rate of around 1.5 bits per instruction is output with instruction-only trace. Thus the volume of trace data generated can be very large with a one 1 Giga Hertz (GHz) processor generating around 1.5 Giga bits per second of instruction trace data alone.
It is known to provide an Embedded Trace Macrocell in order to trace and debug a data processing apparatus in real-time with the core operating at full speed, the trace incurring little performance overhead. Such Embedded Trace Macrocells can provide a cycle-accurate trace, generating a sequence of trace data items indicative of processing activities. In order to reduce the bandwidth of trace data that is transferred to a diagnostic apparatus, it is known to provide an Embedded Trace Macrocell that performs compression (i.e. reduction in volume) of trace data and outputs that data in highly compressed form to the diagnostic apparatus. The trace compression is performed by omitting any information that is redundant or can be deduced by the diagnostic apparatus. A decompressor provided in the diagnostic apparatus then reconstructs the full trace stream.
Thus some existing Embedded Trace Macrocells are configured to remove from the full state of the processor various pieces of data that can be inferred by a decompressor of a diagnostic apparatus. For example, the program counter is not transmitted upon execution of every instruction, since it can be assumed that instructions are generally processed sequentially. Furthermore, the program counter is not transmitted on direct branch instructions because the target of a direct branch instruction can generally be inferred by examining the program code in the decompressor of a diagnostic apparatus. However, other types of branch instructions such as indirect branch instructions which, rather than specifying the address of the next instruction to execute (as in a direct branch instruction), an argument of the indirect branch instruction specifies where the next address is located. Thus, for indirect branches, the address of an instruction to branch to could be stored in a register specified by an opcode of the instruction or could be specified by the value of a memory location. Thus for indirect branches the address to be jumped to is not known until the instruction is actually executed. Such indirect branch instructions are typically associated with a higher than average volume of trace data. Thus although trace compression can be used to reduce the average number of bits used to trace an individual instruction, the nature of the instruction could mean that compression is not as easily achieved.
U.S. patent application Ser. No. 11/442,594 assigned to ARM Limited and issued on 6 Jul. 2010 describes a known system of making predictions in a trace data stream to reduce the trace protocol bandwidth. The system described therein employs a data store (the “return stack”) in the Embedded Trace Macrocell for predicting the return addresses of branch to subroutine instructions (known as branch-with-link instructions in the ARM architecture). The data store operates by pushing the return address of a branch to subroutine instruction onto a stack memory, which is basically a last-in first-out memory. In the event of an indirect branch such as a branch back from a subroutine, the top entry of the data store is compared with the actual branch target determined upon execution. If there is a match then the trace circuitry does not output a branch address to the diagnostic apparatus because a corresponding data store in the decompression circuitry of the diagnostic apparatus should be able to correctly predict the same return address. The diagnostic apparatus makes this prediction from an image of the program code executed by the data processing apparatus and from the diagnostic apparatus data store entries. Trace logic within the Embedded Trace Macrocell is arranged to monitor operation of the Central Processing Unit (CPU) to determine if the prediction made with regard to the branch target address is correct or not and to output either (i) a prediction correct indicator if the prediction is correct; or (ii) an indication of where the program is actually branched to if the prediction is incorrect.
In the case of return from subroutine instructions, the prediction of where the branch is likely to go will often be correct. Thus, provided the diagnostic apparatus that receives the trace data and decompresses the trace data makes an identical prediction, in many cases it should not be necessary for the ETM to output the branch destination information to the diagnostic apparatus, but simply to output an indication that the prediction is correct. This reduces the volume of trace data in relation to return from subroutine instructions. Only in the event that the prediction turns out to be incorrect should a higher volume of trace data be output. The use of prediction of return addresses for branch instructions can be particularly useful for indirect branch instructions where the branch return address cannot be determined from the program code alone.
However, although use of the data store and the prediction circuitry can be effective in reducing the volume of trace data output in relation to branch instructions, a problem can arise because this prediction system heavily relies on maintaining synchronism between the tracing hardware and the decompression circuitry of the diagnostic apparatus to ensure that the predicted addresses are synchronised at the two ends of the system. This poses a particular problem where the data processing apparatus is set up to perform speculative fetching and/or speculative execution of program instructions. Implementation of speculative execution is commonplace in modern data processors because of the opportunities the technique provides for faster operation, for example, by avoiding pipeline stages remaining idle for extended periods of time.
However, speculative instruction execution presents trace circuitry with a difficulty, because until speculation is resolved (i.e. until it is known whether or not a given instruction was actually committed by a CPU), the trace circuitry is unable to provide a stream of trace data that definitively indicates the actual operation of the data processing apparatus. One possibility is for the trace unit to buffer all of the trace data it generates until speculation is resolved, but this requires a prohibitively large buffer memory, particularly if the speculation depth of the processor is significant. An alternative technique is to generate the trace data speculatively along with the speculatively executed instructions and to subsequently cancel some items of trace data if it is found that those instructions to which that trace data corresponds were in fact mis-speculated. For example, the Nexus protocol “The Nexus number 5001 Forum-Standard for a Global Embedded Processor Debug Interface” (IEEE-ISTD 5001-2003, 23 Dec. 2003) supports the cancelling of a specified number of trace data items.
However, even if the data processing apparatus specifically indicates to the trace unit which instructions or groups of instructions should be cancelled, identifying the items of trace data that correspond to those cancelled instructions can be problematic. The situation can be exacerbated in systems comprising a data store to reduce trace data associated with branch instructions (as described in U.S. patent application Ser. No. 11/442,594) because the tracing circuitry will typically resolve speculation prior to analysing the data store. As a consequence, some speculatively executed instructions may result in data being added to or removed from the data store and when that speculated instruction is subsequently cancelled, the data store of the decompression circuitry and the data store of the tracing circuitry can easily become out of step.
It is of course it is possible to avoid using a data store when performing tracing of data in a processor capable of speculative execution to avoid the ETM and the decompression circuitry becoming out of step in this way. However, the data store can show an improvement in trace bandwidth so it is desirable to retain use of the data store even in systems that perform speculative fetching and/or speculative execution of instructions. However, it is also a requirement that the trace data output by ETMs incorporating a data store should reliably and accurately reflect the actual operation of the data processing system being traced.