Large Systems-on-a-Chip (SOC) usually include several components that contain data processing units, together with a local controller, that perform some sort of defined task or sub-task. To perform the system task these local controllers interact with each other. This interaction in addition to the control of the component's data processing units, often based on interrupt handling, make the software for these controllers complicated. The system further may exhibit hard-to-debug behavior, e.g. deadlocks due to the fact that the processor interaction causes interdependencies. On the other hand, visibility into the system, the processor states, and software interaction is limited as observability is constrained by factors such as pin count, intra-chip bandwidth and power minimization. As a compromise between observability and the limiting factors often some form of software tracing is applied. This is basically a kind of “printf-debugging”, where the software running on the different processors is explicitly sending status information to a dedicated trace interface that in turn provides means to collect data either in internal memory or off-chip. There are several, partly contradicting, goals for this:                maximum visibility into software states        flexible data payload (from simple “this procedure was started now” to copies of processed data)        exact information about event time        minimal impact on traced software        no alteration of timing, both for local and for remote processors        minimal code overhead        minimal extra hardware resources        minimal extra power consumption        
FIG. 1 shows a state-of-the-architecture, where trace data is sent over the same backbone bus system as functional data. The trace architecture is essentially a hierarchical star topology as it follows the data bus backbone. Trace data is generated by programs running on the component's local processors 12. It is then sent via the component's local crossbar 13 to a second-level crossbar 16 and a top level crossbar 19 to a trace interface block 17, which is basically an interface to either local memory or pins for off-chip data acquisition 18. The bus system is AHB (advanced high performance bus) and the crossbars AHB-multilayers, although other bus standards would work equally well.
Most functional data flow is contained within a component 111, 112 to 11x and thus travels only via the level 1 crossbar 113. Besides the local processor 12, each component includes a number of bus slaves 141, 142 to 14n, for example data processing blocks and may include one or more DMA engines 151, 152 to 15m. Only the processor 12 and the DMA engines 151, 152 to 15m are bus masters and can initiate data transfers. The data streams originating from the processor 12 and the DMA engines 151, 152 to 15m do not interfere with each other as long as they are targeting different bus slaves 141, 142, 14m like processing blocks or the port to the next level crossbar.
This current tracing approach has the advantage of not needing extra hardware resources to transport the trace data. However, there are also several drawbacks:                Besides the inevitable timing impact on the traced program, i.e. simply by insertion of trace commands, there is an extra level of impact, caused by arbitration stalls, because several bus masters are competing for the path to the trace interface.        Not only is the local program timing influenced, but also programs running on another processor, because trace data from one component can block the data path for functional data to/from another component.        Addresses for write operations to the trace master are forwarded over the whole way from a processor to the trace master, causing power consumption for the address bus.        
When trying to level the timing impact tracing has on a program, one method is to always sent trace data in shadow mode, not only in a debug scenario, so there is no timing difference for runs with normal or instrumented code. In a non-debug scenario trace data is then discarded at the trace interface. This, however, has a negative impact on power consumption as unnecessary data transfers are occurring on all levels of bus hierarchy.
Trace data is packetized and usually consists of more data than what can be transferred in a single bus transaction. Due to the star topology and because of the arbitration of data paths from the different processors to the trace interface, there is no guarantee that a trace data packet from a specific processor arrives in a continuous, uninterrupted stream at the trace interface. In order to ensure packet integrity, the trace interface master needs to contain trace context FIFOs and the appropriate control logic to first collect all packet data in one such FIFO before forwarding the trace packets to data acquisition. This is illustrated in FIG. 2 which shows a state-of-the-art trace interface master for star topology. Trace data arriving from a crossbar is allocated to one of the trace context FIFOs 231, 232 and 23n via multiplexer 24 and control 22 according to the processor the trace data originates from. The trace data of one processor of interest is then forwarded to data acquisition via multiplexer 25 and control 22.
Further, the number of trace context FIFOs is determined by the number of processors in the components and hence the trace master needs to be adjusted each time this number changes.