Technical Field
Methods and example embodiments described herein are generally directed to interconnect architecture, and more specifically, to network-on-chip system interconnect architecture.
Related Art
The number of components on a chip is rapidly growing due to increasing levels of integration, system complexity and shrinking transistor geometry. Complex System-on-Chips (SoCs) may involve a variety of components e.g., processor cores, DSPs, hardware accelerators, memory and I/O, while Chip Multi-Processors (CMPs) may involve a large number of homogenous processor cores, memory and I/O subsystems. In both systems, the on-chip interconnect plays a key role in providing high-performance communication between the various components.
Due to scalability limitations of traditional buses and crossbar based interconnects, Network-on-Chip (NoC) has emerged as a paradigm to interconnect a large number of components on the chip. NoC is a global shared communication infrastructure made up of several routing nodes interconnected with each other using point-to-point physical links. Messages are injected by the source and are routed from the source node to the destination over multiple intermediate nodes and physical links. The destination node then ejects the message and provides it to the destination. For the remainder of the document, terms ‘components’, ‘blocks’ ‘hosts’ or ‘cores’ will be used interchangeably to refer to the various system components which are interconnected using a NoC. Terms ‘routers’ and ‘nodes’ will also be used interchangeably. Without loss of generalization, the system with multiple interconnected components will itself be referred to as ‘multi-core system’.
There are several possible topologies in which the routers can connect to one another to create the system network. Bi-directional rings (as illustrated in FIG. 1(a)) and 2-D mesh (as illustrated in FIG. 1(b)) are examples of topologies in the related art.
As illustrated in FIG. 2 a full 2D mesh is comprised of a grid structure, with a router at each cross point of the grid. The grid has a specific number of routers on X and Y axes. This defines the size of the network, 5×5 being the size in this example. Each router is identified on the grid using its XY co-ordinate. In the figure, origin is at upper left corner of the grid and each router depicts its ID or XY co-ordinate. Each router on the grid has four direction ports and on each of these ports the router can transmit and receive messages over the interconnect wires which form point to point link between the router and the next router along the port. Each router also has one or more host ports through which it connects to host blocks using point-to-point links. The host blocks receives and/or transmits messages from and/or to the network through the host ports.
Packets are message transport units for intercommunication between various components. Routing involves identifying a path which is a set of routers and physical links of the network over which packets are sent from a source to a destination. Components are connected to one or multiple ports of one or multiple routers; with each such port having a unique identification (ID). Packets can carry the destination's router and port ID for use by the intermediate routers to route the packet to the destination component.
Examples of routing techniques include deterministic routing, which involves choosing the same path from A to B for every packet. This form of routing is oblivious of the state of the network and does not load balance across path diversities which might exist in the underlying network. However, deterministic routing is simple to implement in hardware, maintains packet ordering and easy to make free of network level deadlocks. Shortest path routing minimizes the latency as it reduces the number of hops from the source to destination. For this reason, the shortest path is also the lowest power path for communication between the two components. Dimension order routing is a form of deterministic shortest path routing in two-dimensional (2D) mesh networks. Adaptive routing can dynamically change the path taken between two points on the network based on the state of the network. This form of routing may be complex to analyze for deadlocks and have complexities associated with maintaining packet ordering. Because of these implementation challenges, adaptive routing is rarely used in practice.
FIG. 2 illustrates an example of dimension order routing in a two dimensional mesh. More specifically, FIG. 2 illustrates XY routing from node ‘34’ to node ‘00’. In the example of FIG. 2, each component is connected to only one port of one router. A packet is first routed in the X dimension (−X or West direction in this case) until it reaches node ‘04’ where the X co-ordinate is the same as destination's X co-ordinate. The packet is next routed in the Y (+Y or North direction in this case) dimension until it reaches the destination node.
Deterministic algorithms like dimension order routing can be implemented using combinatorial logic at each router. Routing algorithms can also be implemented using look-up tables at the source node or at each router along the path on the network. Source routing involves the source node embedding routing information for each packet into the packet header. In its simplest form, this routing information is an ordered list of output links to take on each router along the path. The routing information is updated at each node to shift out the information corresponding to the current hop. A distributed approach to table based routing is using lookup tables at each hop in the network. These tables store the outgoing link information for each destination through the router. Table based implementation of routing algorithms offer additional flexibility and is more suited to dynamic routing.
An interconnect may contain multiple physical networks. Over each physical network, there may exist multiple virtual networks, wherein different message types are transmitted over different virtual networks. Virtual channels provide logical links over the physical channels connecting two ports. Each virtual channel can have an independently allocated and flow controlled flit buffer in the network nodes. In any given clock cycle, only one virtual channel can transmit data on the physical channel.
NoC interconnects often employ wormhole routing, wherein, a large message or packet is broken into small pieces called flits (also called flow control digits). The first flit is the header flit which holds information about this packet's route and key message level info along with some payload data and sets up the routing behavior for all subsequent flits associated with the message. Zero or more body flits follows the head flit, containing the remaining payload of data. The final flit is tail flit which in addition to containing the last payload also performs some book keeping to close the connection for the message. In wormhole flow control, virtual channels are often implemented.
The term “wormhole” refers to the way messages are transmitted over the channels: When the head of a packet arrives at an input, the destination can be determined before the full message arrives. This allows the router to quickly set up the route upon arrival of the head flit and then transparently forward the remaining body flits of the packet. Since a message is transmitted flit by flit, it may occupy several flit buffers along its path at different routers, creating a worm-like image.
FIG. 3 illustrates a related art scheme for connecting a block 301 in a SoC to the NoC interconnect. The block attaches to the NoC through a bridge or network interface unit (NIU) 302 which translates messages from the block into packetized format for the NoC. The other side of the NIU attaches to one or more port of one or more NoC routers 303. This example shows a router with 5-ports, such as the one in a 2D mesh NoC. Ports of the router are connected to adjacent routers through point to point links.
One facet for employing Network-on-chip technology for interconnects in an SoC is the micro-architecture of components of the NoC and the physical design of the whole NoC infrastructure in conjunction with blocks of the SoC. The physical design further encompasses aspects of area, frequency, floor-planning, placement and routing, power and clock distribution, timing closure etc. Many digital systems in the related art employed full synchronous designs, where operations in the system are coordinated by a single global clock switching all the sequential elements of the system. For proper operation of such systems, there is a fundamental requirement that a given clock edge arrives at all sequential elements of the system simultaneously. However, this is hard to achieve in practice, and all digital systems exist with finite clock skews which have a bearing on the maximum frequency achievable by the synchronous system. For systems of reasonable size and relatively low frequency of operation, some clock skew is tolerable and has been managed using various physical design techniques. In these cases, a fully synchronous implementation is the preferred approach due to its simplicity and abundance of mature tools and methodology for silicon implementation.
With rapid Complementary Metal Oxide Semiconductor (CMOS) process scaling and increasing system complexity, more and more functionality is being integrated on a single silicon die. Gate delays have seen significant reduction, but wire delays do not have that trend. Hence, even though the clock frequencies have increased to keep up with increased performance requirements, metal wiring used to distribute clocks and signals on chip hasn't had major improvements. Routing delays and clock skew now constitute a significant percentage of the clock cycle time. Skew balanced distribution of a global clock to the massive number of sequential elements on a large Si die has become largely impractical and prohibitively expensive in terms of area and power consumption. Further, the large number of heterogeneous components on a die also means that they require different operating frequencies and independent clock on/off control for better power management.
The trend has been for globally asynchronous locally synchronous (GALS) systems. A basic schematic of such a system is illustrated in FIG. 4. Here large blocks 401 form local islands of fully synchronous designs, with different blocks of the system operating asynchronously to each other. The interconnection network 402 of the system handles the synchronization of communications among the GALS blocks. This allows skew balanced clock distribution to be contained to the relatively smaller areas of each block 401.