In the design of digital circuits, particularly in the design of digital circuits with so-called systems on a chip (SOC), a number of components must be connected to one another. These components can be divided into master and slave units. Masters are generators of transactions which cause a data transfer, and slaves are consumers of transactions. The data flow itself can take place in two directions: from a master to a slave (write transaction) and from a slave to a master (read transaction). The components which are to be connected to one another can also be, at the same time both master and slaves, by introducing more than one interface type.
A communication system arranged between a master and a slave should enable a number of masters to communicate simultaneously in a non-blocking way with multiple slaves. The different interface types within the same components and/or within different components can be operated at the same clock rate, at clock rates dependent on one another or at completely independent asynchronous clock rates. The last case mentioned is the most difficult one to handle with a total of M master interfaces which communicate with a number S of slave interfaces.
The simplest solution to the communication problem mentioned above consists of providing an independent communication path from each master to each slave with which the master is to communicate and to arrange memories with grey-coded read and write pointers on each communication path. The memories (buffers) synchronous the transaction codes, the write data and the read data between the clock domains of a master and of a slave. The solution meets the following criteria: a short latency period of an individual buffer between a master and a slave, high throughput because independent communication channels provide for non-blocking parallel communication; independent clock relations between each master and each slave so that changes in a clock rate of an individual module do not have any effects on the interconnection. One disadvantageous of this concept consists in that the silicon area increases greatly for larger systems because M×S buffers are needed. If, for example, the buffers need to accommodate 32-bit wide data words and burst sizes of 8 words, this concept leads to a significant consumption of area.
Another solution of the above-mentioned problem improves the utilization of silicon area by using a two-stage clock domain cross-over: firstly, each master uses one or more buffers for a translation to a crossbar network. At the end of the network, another storage element is used for changing into the clock domain of the slave. This requires M+S instead of M×S buffers which means a significant reduction for greater numbers of masters and slaves. However, the latency is higher since two buffers must be passed through on each path from each master to a slave. The throughput is the same because the crossbar provides for non-blocking communication from each master to each slave. Furthermore, a flexibility of clock rates for master and slaves is given due to private buffers for each master and slave which enables each clock domain to be bridged without adversely affecting other components.
For a routing application, where the data flow from the masters to the slaves is unidirectional, even smaller interconnect schemes are known. The transaction code is extracted from a header of a data packet which is sent in frames. The transaction code comprises at least the destination address and the number of data units to be transmitted. The data units follow the payload of the packet. For each master, one buffer is provided for accepting the input packet. The destination address is used for determining the output port to a slave. Since a plurality of masters may wish to communicate with the same slave simultaneously, an arbitration mechanism provided for each slave decides which of the masters is really connected at a given time. The connections themselves can be provided by a crossbar structure. The buffers can enable the different clock domains between a master and a slave to be bridged. The interconnect structure reduces the number of buffers to M. The latency is also reduced to the latency of a single buffer between a master and a slave. The throughput is the same as in the other known concepts due to the non-blocking crossbar.
A further problem which occurs when designing digital circuits with the systems-on-a-chip as mentioned above is that interfaces or internal accelerators need a data transfer between one another and, for example, to and from a memory at high data rates. It is not practical to transmit data at high data rates by using a CPU which copies the data because a CPU must first write the data into an internal register before it can forward the data to another destination. For this reason, components with high data rates are usually connected to a DMA (Direct Memory Access). At the least, this enables components with high data rates to access a memory for a relatively long time independent of a CPU, the source and destination addresses having to be reconfigured from time to time by the CPU. This feature is called peripheral-to-memory copy and memory-to-peripheral copy because the memory access can occur in both directions. This also enables direct copying between two components without intermediate storage in a memory (peripheral-to-peripheral copy). It is also makes it possible to implement a memory-to-memory copy.
There are two possibilities for a DMA implementation: decentralized DMA and centralized DMA. The decentralized DMA requires that each component has an inbuilt bus-master capability so that it (or other components) can independently access a memory. The centralized DMA can form an interface to components with simple slave interfaces by working as a master for the memory interfaces for carrying out data transmissions to these components, and by operating as a master for the memory interfaces. A DMA controller of a central DMA can be coupled to all components by using a single interface or by using a number of interfaces. The more interfaces that are used, the higher data rates are achieved and the more independence can be theoretically provided in the clock domains since a single interface requires all components to be operated in the same clock domain. The design of a central DMA architecture for different SOC designs is complicated for the following three reasons:                1. The components which are to be connected to a central DMA unit usually have interfaces of different origin: some are older in-house modules, some are procured from different IP equippers at different times. In the worst case, each of the components to be connected can have different types of a slave interface. A central DMA unit should therefore be capable of handling master tasks for a plurality of interfaces.        2. Each component connected to the central DMA unit should be capable of operating at the clock speed associated with it, with regard to which the clock speeds of other connected components, the DMA itself and the memory, are independent and asynchronous.        3. The number of components to be connected to a DMA unit, the number of interfaces supported, the clock speeds and the data rates are highly system specific. For this reason, each SOC should ideally have a tailor made DMA, but this causes high implementation and verification expenditure.        