The present invention relates in general to data processing systems, and in particular, to program execution tracing within a superscalar processor.
The present invention addresses the need to acquire a real-time trace of program execution from a high performance superscalar microprocessor. Typically, users wish to obtain a xe2x80x9ctracexe2x80x9d or listing, of exactly what instructions execute during each clock cycle for a limited period of time during the execution of a program in order to debug or analyze the performance of the program. A xe2x80x9creal-timexe2x80x9d trace is one that can be acquired while the program runs at normal speed, in the actual system environment, and can be triggered by some system event recognized by the trace acquisition system. Note that since any buffer used to acquire a trace will have a finite number of entries that will likely be much smaller than the number of clocks consumed in the execution of the program, the trace acquisition system must be able to selectively retain only the information for the clock cycles of interest, i.e., those just before and just after the xe2x80x9ctriggerxe2x80x9d event (xe2x80x9cTExe2x80x9d). Further, the system must provide a means for synchronizing the TE with the contents of the trace buffer so that the user can tell exactly what instructions were executing during the clock cycle that the TE occurred. A xe2x80x9cnon-invasivexe2x80x9d trace is one that can be acquired without disturbing the timing behavior of the program from its behavior while not being traced.
A difficulty in acquiring a trace from a highly integrated processor stems from the invisibility of most of the signals required to derive the trace. A typical approach to deriving an instruction trace requires one to determine the location of an instruction being executed on a particular clock cycle (i.e., at the start of the trace), and then to determine for subsequent clock cycles how many instructions are executed, whether they are taken or not if they are branches, and the target addresses for the taken branches.
Because the processor has an integrated instruction cache, the instruction address bus is not accessible externally and hence, each instruction fetch cannot normally be seen. Also, the signals that indicate the number of instructions executed each cycle and the direction taken by conditional branches are not usually available externally to the integrated circuit (xe2x80x9cICxe2x80x9d). Therefore, some information must normally be exported from the microprocessor in order to acquire the trace. This information should appear on the external pins of the IC; either on pins that are already used for other purposes such as external data and address buses, or on pins dedicated to the tracing function.
Multiplexing trace data onto existing pins has two potential problems. If the trace runs all the time, it will contend for system resources (e.g., bus bandwidth), degrading performance to support a feature that is only used during software debug operations. If the trace data is switched on only when acquiring a trace, it may affect the timing of the program by delaying the processor""s normal access to the shared pins, and thus will be intrusive. Dedicated pins can alleviate this problem; however, to maintain low cost of the IC, the pin count must be kept as low as possible.
U.S. patent application Ser. No. 08/760,553, which is hereby incorporated by reference herein, disclosed a set of hardware additions made to a microprocessor to provide a non-intrusive, real-time trace capability with low additional costs to the processor. However, that trace solution was operable for low-mid performance, single-issue microprocessors running at frequencies below 100 MHz, such that the external pin requirements were minimal. In contrast, high-performance, superscalar microprocessors present new challenges for design and innovation. These processors run at aggressive frequencies (over 400 MHz) and have the ability to complete multiple instructions in a given cycle. This results in several related problems. External trace probes (or logic analyzers) have difficulty collecting data at the higher frequencies, so trace information must broadcast at a fraction of the processor frequency. In order to maintain data bandwidth at this reduced frequency, the number of trace pins must be increased. In addition, the completion of multiple instructions in a given CPU (central processing unit) cycle increases the data bandwidth requirements, further increasing the number of pins required to maintain that bandwidth. Pins come at a high cost, as many ASICs (application specific integrated circuits) that incorporate cores will be I/O (input/output) constrained. That is, there will not be enough pins on the periphery of the chip to support internal logic. Although customers want real-time trace capabilities, there is significant pressure to reduce the I/O requirements for the trace function, since it is primarily used for debugging code during development and is not used by the end application. This need to acquire real-time trace of program execution from a high-performance, superscalar microprocessor presents special problems due to increases in operating frequency and data volume.
The present invention addresses the foregoing need by providing a novel combination of features, which allow a high-performance superscalar microprocessor to provide real-time trace-forward and trace-back capability with a minimum number of pins running at a minimal frequency relative to the processor frequency. The present invention provides for the gathering into trace buffers of information on indirect branch targets, interrupt vectors, periodic synchronizing event information, fence and trigger event codes, and instruction (including branches) and interrupt completion information. The present invention then encodes and broadcasts the aforementioned information using a minimum number of pins and at a minimal frequency to enable reconstruction of the real-time execution path by external trace software. The present invention further limits or prevents the occurrence of certain instruction processing combinations over a given range of CPU cycles, such occurrences including the number of completing branches, the number of interrupts, and the occurrence of an interrupt with a certain number of completing instructions.
The foregoing has outlined rather broadly the features and technical advantages of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention.