The field of the invention is electronic circuits and processes for tracing the operations of high-speed computers and microprocessors and devices and systems using them. Proper operation of both hardware and software is vital for functionality, reliability and desired performance of systems on which users depend. Moreover, tracing or profiling system performance and events in real time plays a very key role in hardware testing and software development upon which successful product operation and timely introduction of new products depends for business, government, and the consuming public.
High-performance computers have processors that execute instructions like operations on a factory assembly line wherein stalling the assembly line interrupts and reduces the production substantially. In fact some of these pipelined CPUs can process multiple streams (threads) of instructions even from different software programs, and this type of operation is called multi-threading. In complex processor(s) such as a digital signal processor (DSP) or other pipelined processor, being able to know the duration of CPU (central processing unit) pipeline stalls and to associate them with the instructions that caused these stalls and also to report the reason of stalls can be very useful information for software code developers to reduce stalls and increase performance. Similarly tracing events with respect to instructions which were executed when these events occurred is also important data for code developers.
High-performance computers often also have a multi-level cache-based memory system. The CPU is operating so fast that it often cannot wait for instructions or data representing a current point of a software program to be obtained from a slower central memory or storage. Instead, high-performance circuits access such instructions and data ahead of time enough to put them in a cache memory that is physically very close by and operates very fast for access by the CPU. The cache memory circuit may have hierarchical cache or multiple levels of cache to mediate the access process for higher performance. If the cache memory circuit lacks the instruction or data that is needed fast, e.g. by the CPU, a “cache miss” has occurred, and will likely have an impact on the device performance.
For instance, in case of a cache miss, some of the cache circuitry generates a cache miss signal that stalls the applicable pipeline of the processor and activates a Stall field of a control register until the cache miss is cleared. In case of an L2 (Level 2) Cache miss in one example of a multi-threaded processor, an L2 Cache Miss line from the cache circuitry goes active and hardware-activates the stall in a hardware state machine that controls the threading or activates a hardware interrupt to the operating system OS. A stall duration circuit including one or more stall duration counters is suitably provided and responsive to an active state of the stall field to count up and deliver stall duration data presenting the time duration of each stall in each pipeline.
In addition to such Stalls, various control and status signals called Events in a processor system are desirably traced to understand their connection with any problems in device operation. (Among such Event signals are address and data controls of one or more buses, memory read and write controls, DMA (direct memory access) activity, interrupts, signals from peripherals, or logical combinations thereof, any one or more of which are designated as Events for tracing purposes. An Event of interest could be composite in nature, such as a write with a specific data value generated by a specific instruction.)
This Stalls and Events data can be vital to use for software code optimization and debug and system profiling. In the current complex cache-based systems it is very difficult to accurately model the system and determine its performance precisely without collecting real-time data. It is possible in principle to collect performance information on such stall and events data precisely and robustly through a trace-based system and even to provide Reason data associated with a stall or an event. However, the trace output stream can impose very high bandwidth requirements, and such streams of trace data can overwhelm an attempt to capture them.
A processor has a clock circuit that generates pulses to continually actuate the processor, and also a program counter that can be advanced by the clock circuit to point to a new software instruction for access from cache and execution by the processor pipeline. If the pipeline is not ready for another instruction, the program counter may be temporarily disabled or inactive. A timing trace stream, the trace stream that indicates activity or non-activity of the program counter (PC) each clock cycle, can occupy a large percentage of the demand for bandwidth of the transmitted data. Trace bandwidth demand is related to rate of trace data generation. If the activity pattern of the PC is quite complicated, massive amounts of trace bandwidth are demanded for the timing trace. For some background on trace export and synchronization markers, see U.S. Pat. No. 7,315,808 “Correlating on-chip data processor trace information for export” (TI-30481), which is incorporated herein by reference. See FIGS. 3-8 in that '808 patent and FIGS. 33-37 hereinbelow. For some background on trace encoding, see U.S. Pat. No. 7,721,263 “Debug Event Instruction” (TI-60665), which is also incorporated herein by reference. See also U.S. Patent Application Publication 20030033552 “Apparatus and method for wait state analysis in a digital signal processing system” (TI-33188), which is also incorporated herein by reference.
A VLIW DSP (very long instruction word digital signal processor), such as a TMS64xx™ processor from Texas Instruments Incorporated with eight data paths running at 600 MHz, can execute 4.8 BOPS (billion operations per second), i.e. the product of 8 instructions/clock-cycle×600 MHz. Capturing four-byte or 32-bit PC (program counter) values from even a single processor CPU running at 600 MHz would generate 2.4 GByte/sec of PC data (4 bytes/cycle×600 MHz). Serial output of the data would involve a clock rate of 19.2 GHz (8 bits/byte×2.4 GByte/sec), which would be impractical or at least uneconomical for most current systems. Even if on-chip compression were used to reduce this enormous bandwidth requirement by, e.g. a factor of 10 depending upon the program activity, the resulting average trace bandwidth would be a massive 240 MB/sec.
The code sequences run for system profiling are huge and, in order to get accurate Stalls and Events data per instruction executed along with Reason data on the Stalls and Events, would imply that large quantities of data must be collected and exported by the trace hardware. The trace hardware could therefore have to occupy uneconomical amounts of circuitry and integrated circuit chip area or real estate.
Accordingly, substantial technological departures and alternatives in trace circuitry, traceable processor devices, and processes for efficiently and economically structuring, operating and signaling in such circuitry are highly important and needed in this field.