1. Field of the Invention
The present invention relates to routing techniques.
The invention has been developed with particular attention paid to its possible application to systems on chip (SoCs) and, more specifically, relates to the so-called networks on chip (NoCs), above all in relation to the known NoC topology referred to as “Fat-Tree topology”. NoCs replace traditional buses and, as compared to buses, afford characteristics of modularity and the possibility of reuse. In particular, NoCs comprise switches and hence active nodes, which are designed for connecting the macroblocks (such as, for example, microprocessors memories, or the like).
Reference to this preferred context of application of the invention must not, however, be interpreted as in any way limiting the scope of the present invention.
2. Description of the Related Art
The appearance of networks on chip results from the evolution of systems on chip. The latter, following Moore's law, have reached a complexity such as to require a substantial re-thinking of the on-chip connection infrastructures, which up to now was represented for the most part by bus architectures. A factor that has concurred to rendering the situation more critical, as well as to increasing the complexity of chips, is the marked miniaturization of transistors. The dimensions reached have rendered increasingly less advantageous the generation of signals on metal paths of considerable length, both as regards the integrity of the signals and as regards the excessive expenditure in terms of energy requirement. For this reason, the hypothesis of a direct connection between computational modules, as occurs in the case of the bus, appears in many cases non-productive and risky.
The reasons that have pushed in the direction of a solution of the network-on-chip type and that have favored development thereof can basically be reduced to two major categories of factors:                the performance required by new digital systems; and        the impact that the evolution of digital systems has had on the productivity of the firms operating in the sector.        
More in general, the history of networks on chip has its roots in networks of processors. These have in fact been taken as main reference for the development of networks on chip thanks to the strong analogy existing between the two realities.
The main contributions coming from the study of networks of processors (see, for example, Russ Miller, Quentin F. Stout: Algorithmic Techniques for Networks of Processors, CRC Handbook of Algorithms and Theory of Computation, 1998, pp. 46:1-46:19) relate prevalently to algorithms for optimization of performance in routing of packets and to the study of topologies using graph theory.
In the first case, there has been inherited the majority of the taxonomy regarding protocols (Worm-Hole protocol and Store-and-Forward protocol) and the algorithms for routing packets (see, for example, Christian Scheideler, Universal Routing Strategies for Interconnection Networks, LNCS1390, Springer 1998), whilst, in the second case, studies conducted on particular hierarchical topologies called Fat-Tree topologies have been of considerable help (see, for example, C. Leiserson, Fat-Tree: Universal Networks for Hardware-Efficient Supercomputing, IEEE Transactions on Computers, vol. C-34, No. 10, pp. 892-901, October 1985), the topologies having the capacity of guaranteeing high levels of throughput with a limited number of switches and connections.
Other significant results regard the study of topologies and algorithms which prevent the formation of deadlocks in the network (see, for example, William J. Dally, Charles L. Seitz, Deadlock-Free Routing in Multiprocessor Interconnection Networks, 1985); from this latter study, it emerges that acyclical topologies, such as the Fat-Tree type, are less subject to deadlock situations.
The topology of a Fat-Tree type appears to be amongst the most favored for an implementation on silicon, in so far as it enables excellent levels of performance to be achieved with a contained number of switches (see, for example, Fabrizio Petrini, Marco Zanneschi: “k-ary n-trees: High Performance Networks for Massively Parallel Architectures”, 11th International Parallel Processing Symposium, Vol. 1, Geneva, Switzerland 1997).
As may be noted in FIG. 1, a network of this type is formed by n levels of switches SW. The computational modules or processes P are connected to the lowest level of the network and constitute the leaves of the tree. The highest level is constituted by switches, each having k connections with k switches of the underlying level. The switches of the other levels have k connections with both of the adjacent levels. Each connection between the switches is constituted by two one-directional connections L1, L2 that transport the packets in opposite directions in such a way that one switch SW or one process P can send or receive packets along one and the same connection. Each connection is constituted by one part dedicated to data transportation, with parallelism p, and one part corresponding to the control signals. The maximum number of processes that can be connected to the network is N=kn, and the processes are constituted by any module IP that is able to generate and/or acquire packets (DSPs, processors, memories, external network interface modules, DACs, etc.), whilst the number of switches is S=n*kn−1. Set between the processes and the switches are interfaces IF, which have the task of adapting the protocol used by the processes to that of the network in such a way that the packets sent by a source process can be correctly routed as far as the destination process P.
In order for the routing procedure to be effective, it is preferable that both the switches SW and the interfaces IF (and hence the respective processes P) should have a unique identification number (ID) containing the information on the position in which they are located.
FIG. 2 represents an example of assignment method described in: Fabrizio Petrini, Marco Vanneschi, “k-ary n-trees: High Performance Networks for Massively Parallel Architectures”, cited previously.
Basically, each processor is defined by an n-tuple of numbers ranging from 0 to (k−1), whilst each switch is defined by an orderly pair <w,l>, where w is formed by n−1 numbers ranging from 0 to k−1, and l is a number ranging from 0 to n−1.
Two switches <w0, w1, . . . , wn—2, i> and <w0′, w1′, . . . , wn—2′, l′> are connected if and only if l′=l+1 and wi=wi′ for every i≠l.
The switch <w0, w1, . . . , wn—2, n−1> and the processor p0, p1, . . . , pn—1 are connected if and only if wi=pi for every i belonging to the set {0,1, . . . ,n−2}.
The modality of routing of the packets can vary according to the protocol and the algorithms used. There basically exist two types of routing protocols for networks on chip, from which there derive others, namely, the Store-and-Forward protocol and the Worm-Hole protocol (see, for example: Christian Scheideler, Universal Routing Strategies for Interconnection Networks, LNCS1390, Springer 1998; P. Guerrier, A. Greiner, A Generic Architecture for On Chip Packet-Switched Interconnections, DATE2000; and E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, E. Waterlander, Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip, DATE2003).
In both cases, the packets must be stored entirely within the switches before being transmitted onto the next connection. In the Store-and-Forward case, each packet is routed irrespective of the others since the necessary information is contained in the header of each packet. In the Worm-Hole case, the packets are grouped into messages, and only the first packet of each message contains the header with the information for routing. In this case, a switch, after having transmitted the first packet of a message along a connection, reserves that connection for all the other packets of the same message in such a way that they can follow the same path as the first one, without interposition of any packet extraneous to the current message. After the last packet has been transmitted, the switch de-allocates the connection so as to render it available for another message.
When two or more packets contained at input to a switch are contending one and the same output connection, they are said to collide. The task of the routing algorithms, in particular of the routing scheme, is that of choosing, for each packet, the path that will reduce the possibility of collisions to the minimum, maximizing the routing speed.
The routing schemes can be divided into non-adaptive ones and adaptive ones (for a more complete treatment see, for example, Christian Scheideler, Universal Routing Strategies for Interconnection Networks, LNCS1390, Springer 1998, cited previously). In the first case, the path of a packet is decided only on the basis of the source and destination (A. Radulescu, K. G. W. Goossens, Communication Services for Networks on Chip, SAMOS, vol. II, pp. 275-299). In this way, all the packets coming from one and the same source arrive at destination in order, having followed the same path, thus rendering unnecessary for the interfaces the task of re-ordering the packets (see once again A. Radulescu, K. G. W. Goossens, Communication Services for Networks on Chip, SAMOS, vol. II, pp. 275-299). In the second case, the path will be adaptable to the different traffic conditions of the network (see once again: Fabrizio Petrini, Marco Vanneschi, k-ary n-trees: High Performance Networks for Massively Parallel Architectures), enabling optimization of the distribution of the packets.
Given that the number of connections between two levels of network hierarchy corresponds to the maximum number of processes N, the probability of collisions between packets is high even in conditions of moderate traffic. For this reason, it is appropriate for each packet to occupy, along its path, the smallest possible number of connections, i.e., to choose the minimum path between the source and the destination so as to minimize the likelihood of collision. Since the Fat-Tree is a hierarchical topology, the minimum path will be represented by an ascending stretch, which will bring the packet up the hierarchy as far as the “root switch” (common to the source and to the destination), and a descending stretch towards the destination. Given that there can exist more than one root switch common to a source and to a destination, there may exist a number of paths that will lead a packet from the source to one of these switches. Once it has arrived at the root switch, there, however, exists only one path which links the packet to the destination.
FIGS. 3 and 4 show the behavior of an adaptive routing scheme and of a non-adaptive routing scheme, respectively. In both figures, the references SW, IF and P designate, as in the preceding figures, the switches, the interfaces, and the processes, respectively (with the distinction, in the case of the latter between source S and destination D). The arrows facing upwards indicate ascending paths and the arrows facing downwards indicate descending paths.
The switch represents the active component of the network. As illustrated in FIG. 5, it is constituted by:                2*k input ports 10, which accept the packets and store them in dedicated buffers 10a;         2*k output ports 20, which function as temporary-memory locations (buffers 20a) for the packets in the transmission step;        control logics for the input ports and output ports, the input logic being designated by 30 and the output logic by 40; these handle acquisition, routing, and transmission of the packets;        a first crossbar 60 for the data, which connects each input port to all the output ports; and        a second crossbar 70 for the control lines, which connects the input control logic with the output one.        
The control lines of each connection are constituted by a write line and a ready line (in the case of the Store-and-Forward protocol). The write line is driven by the port of the switch that transmits the packet (output port) and is read by the port of the switch that receives the packet (input port), whilst for the ready line the opposite applies.
Prior to acquisition of a packet, the output port that wants to transmit a packet checks whether the ready line is active; if it is, it activates the write line and, in the next clock cycles, transmits the packet on the data lines. When the input port starts to receive the packet, it disables the ready signal, which will remain disabled until the packet is transmitted to another switch. At the end of the transmission of the packet, the output port disables the write signal. When the packet is received by an input port, it is stored in the corresponding buffer. The buffer of each input port behaves as a queue (FIFO). If each packet is assumed as having a size of h bits (multiple of p), then each buffer will be characterized by a parallelism p equal to the parallelism of the data lines and a depth d=h/p, which will indicate also the time (in terms of clock cycles) used for storing a packet and the time for re-transmitting it.
As represented schematically in FIG. 6, which regards the mechanism of internal scheduling, during the step of acquisition (FIG. 6a) the control logic 30 corresponding to the input port in question reads the header of the packet and, on the basis of the routing scheme implemented, signals to a given output port the intention to transmit the packet via a request signal.
The request signal is transmitted through the crossbar 70 of the control signals. At this point, the control logic 40 corresponding to the output port decides, on the basis of a scheduling algorithm, whether or not to grant permission to transmit the packet.
If it does (FIG. 6b), the logic 40 issues a grant signal (once again through the crossbar of the control signals 70) to the control logic of the input port in question. At this point, if the switch or the interface downstream is available for reception, the packet is transmitted through the output port selected, passing through the data crossbar. Further details on the mechanism may be inferred from: E. Rijpkema, K. G. W. Goossens, A. Radulescu, J. Dielissen, J. van Meerbergen, P. Wielage, E. Waterlander, Trade Offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip, DATE2003 and E. Rijpkema, K. G. W. Goossens, P. Wielage, A Router Architecture for Networks on Silicon, PROCEEDINGS OF PROGRESS 2001.
Summing up what has been said previously, unlike bus architectures, the network on chip (NoC) is based upon the transmission of packets, as in the case of networks of processors. It is constituted by switches, which have the task of routing the packets, and by physical connections between switches, which represent the medium through which the packets are routed. Given that only some connections are involved in routing of a packet, in the case of the network on chip there is a considerable saving in terms of energy per bit transmitted if compared to the case of the bus. In fact, in this latter case, all the transmissions occur via broadcast even though the source and destination of the data flow are close to each other.
The use of brief connections in the network on chip makes it moreover possible to preserve better the integrity of the signals, since they are less subject to disturbance.
A further advantage lies in the fact that it is possible to think of the resource “network on chip” as being formed by a variable number of resources (the physical connections between switches), which can be allocated in an independent way. This enables parallel handling of different routing requests, unlike the bus solution, in which the bus itself constitutes an allocable resource only as a whole and hence is available only for transmissions of data in series. Given that the number of connections available in a network on chip, and hence its capacity for parallelizing transmissions, can vary according to the design requirements, the network on chip is defined as “scalable”, unlike the bus network, which is “non-scalable”.
The characteristic of high level of parallelism enables a considerable increase in the throughput and a reduction in the time that a source must wait before being able to transmit the data to the destination.
An aspect of increasing importance in the design and construction of systems on chip (SoCs) is the increasing gap between the possibilities, in terms of performance and complexity, provided by the new technologies and their manageability. The increasing number of ports that can be housed on a single chip poses in fact the problem of handling, with times compatible with those of the market, designs of exponentially increasing complexity. Given that the problem cannot be faced by increasing exponentially the dimension of the project teams, it is preferable to break up the complexity of the problem into a hierarchy of sub-problems. To do this, it is preferable that each sub-problem can be handled independently of the others. The result is that of generating for each sub-problem a module (IP) in turn constituted by other modules.
The reuse of the existing modules hence becomes an important characteristic of this approach. The advantages linked to this choice are:                contained times for design, since the design of the SoC is articulated on different levels, and at each level the designer must at the most assemble the modules present at a lower hierarchical level;        higher reliability, in so far as each module is tested separately and hence in a more exhaustive manner and with less effort;        greater predictability, on account of the fact that the modular structure of the SoC reduces the degrees of freedom of the system, which hence results in being a combination of a limited number of well-known functions; and        greater reusability since the modules created can be reused in other SoC contexts, so reducing drastically the design times and costs.        
The disadvantage of a set-up of the above sort is the loss in terms optimality of the result, given that intervention at each level of the design cannot extend to intervention on the actual modules employed. The network on chip fits perfectly within such a context. In fact, it tends to replace the bus, which is a fruit of a design methodology that meets requirements different from current ones, creating in its stead an IP block. This is reasonable if it is considered that currently the problems generated by interconnections are in many cases comparable to, if indeed they do not prevail over, the ones linked to computational modules. For this reason, it can be of considerable help to create libraries of IP modules that concern also the infrastructures of communication of the SoC. These NoC libraries may be set alongside the traditional ones, so contributing to guaranteeing high levels of performance and shorter production times with more contained costs.