The invention relates to a circuit arrangement with a plurality of functional units each of which comprises a plurality of data processing modules and a local controller, said plurality of data processing modules running a common system clock and being connected by a streaming data bus running a handshake-type streaming data transfer protocol. The invention also relates to a method for profiling a data flow of streaming data for use in such a circuit arrangement.
When building large systems-on-a-chip (SoCs) such as for use in mobile communication applications, designers will combine several IP blocks, also known as IP (intellectual property) cores, even possibly from different vendors, via well-defined bus interfaces.
Complex SOCs, with multiple embedded controllers communicating concurrently, both with each other as well as with other hardware units (e.g. data processing modules), pose a challenge when it comes to optimizing system performance, finding bottlenecks and even more so for debugging real-time problems.
Handshake-type bus protocols are known as a simple and straightforward means to stream data between data processing modules within one component of an SoC and also between data processing modules of different components. However, a system based on a handshake-type bus protocol interconnection might still exhibit a complex and unexpected behavior. Even if data is processed nominally, the system performance can still be inferior due to modules stalling each other based on their processing speed and their interdependencies. The system could even run into a deadlock situation, although all modules work in accordance to their specifications. These deadlocks and less fatal bottlenecks are especially hard to debug, since usually they are not caught by simulation, due to resource limitations (time and test cases).
Some examples of complex interdependencies are illustrated in FIGS. 1 and 2.
Several scenarios of how one data processing module of an SoC component can influence another, sometimes via several hops, will be exemplified with reference to FIG. 1. In FIG. 1, the streaming protocol fabric is depicted with bold arrows, control paths with thin arrows. The exemplary component of FIG. 1 comprises six data processing modules 11A-11F and a local controller 12. Data processing module A provides data in aligned manner to both data processing modules B and C. Data processing module C processes data from processing modules A, E, and F in an aligned manner. As will be understood from the figure, a stall, i.e. a delay, in module B can stall module A, because A cannot send data to B. A stall in module A can stall modules E and F, as module C processes data from A, E and F in aligned manner. A stall in module B can stall module C, as module A sends data synchronously to B and C. There is even a possibility of a deadlock situation in case of a ring-dependency among modules A, B, D, and back to A. Hence, if there is not enough FIFO capacity along the route, a stall in one module will bring the whole loop to a halt, and, because of the aforementioned scenarios, all other modules of the component, too.
Whether or not situations as described above will occur depends firstly on the individual module's inherent processing and communication patterns, and secondly on the programming and start sequence by the component controller.
FIG. 2 exemplifies inter-component dependencies illustrating that similar stall scenarios as mentioned above are also possible across component boundaries. FIG. 2 shows a first SoC component 20 controlled by a first local controller 22 and comprising two data processing modules, 21A and 21B, and a second SoC component 30 controlled by a second local controller 32 and comprising two data processing modules, 31C and 31D. In FIG. 2, intra-component streaming data paths are shown in dashed bold arrows, streaming data paths across component boundaries are depicted as solid bold arrows, control paths as thin arrows. As will be understood from FIG. 2, a stall in D, for example, can stall C, as A is sending aligned data to C and D. However, such bottlenecks are even more complicated to detect and avoid, because two independent component controllers are involved.
Various methods are known to tackle the problem of real-time debugging and profiling in general. These include for example debug buses, test code run by the embedded controller(s), means to observe internal states via debug ports, optionally connected to an external logic analyzer.
However, especially when it comes to profiling for system improvement, these known methods pose considerable drawbacks. When using debug ports and/or external logic analyzers, the problem is that on a pin-limited SOC, but also on Field Programmable Gate Array (FPGA) prototypes, there usually are not enough pins to accommodate this task. With internal trace memory, the issue is that on-chip memory is a scarce and expensive resource, especially on an ASIC, and using it just for profiling can usually not be justified. Re-assigning functional memory to profiling is a potential solution, however, there might not be enough internal memory available, or this approach might interfere with normal operation.
Having the embedded controller(s) run a diagnostic code is usually possible with only small extra cost in code and data memory. However, it might be misleading, because the code run changes the actual system timing/behavior, so the profile obtained will be of less value or even wrong.
What is needed in the art, therefore, is a simple and low cost means for assessing intra-component and inter-component link performance and communication patterns on large SoCs.