The present disclosure considers the context of inter-cluster communication in a multi-core System-on-Chip (SoC) where the clusters are the processing cores (including their local L1 memories) and the shared higher-layer memories on the SoC.
Current SoCs contain many different processing cores that communicate with each other and with the many distributed memories in the layered background memory organization through an intra- and inter-tile communication network. Tiles are formed by a group of tightly connected cores (processors), i.e. cores between which the activity exceeds a certain threshold level. One important design feature of the SoCs relates to the length of the interconnections between the clusters. State-of-the-art solutions have relatively long connections that need to be nearly continuously powered up and down, reaching from the ports of the data producers/consumers (inside the tiles or between different tiles) up to the ports of the communication switches. The present-day SoC inter-tile communication networks are based on different types of busses (shared or not) and networks-on-chip (NoC).
An application field is that of neuromorphic systems. Neuromorphic systems, also referred to as artificial neural networks, are computational systems so configured that electronic systems wherein they are provided, can essentially function in a manner that more or less imitates the behavior of networks of biological neurons. Neuromorphic computation does not generally utilize the traditional digital model of manipulating zeros and ones. In order to allow communication between potentially strongly connected neurons in such neural networks, connections are created between processing elements which are roughly functionally equivalent to neurons of a biological brain. Neuromorphic computation may comprise various electronic circuits that are modelled on biological neurons and synapses. Typically multi-layer neural networks are used, with one or more hidden layers (or in general intermediate layers if non-standard neural network topologies would be used). Some well-known examples include perceptrons, convolutional neural networks (CNNs), asynchronous conceptors, restricted Boltzmann machines (RBMs) and deep-learning neural networks (DNNs). In all of these, synaptic plasticity/adaptation is crucial. They can use synchronous or asynchronous signaling protocols. Because of the strong resemblance to how a human brain works, the asynchronous spike-timing-dependent plasticity (STDP) spiking protocol is very popular in the neuromorphic community. Neuron layers in principle should be “fully” connected with one another to allow all connection flexibility, which leads to a densely connected neural array, for example with N1×N1 synapses for N1 neurons in the input layer and N1 neurons in the output layer of the stage. However, typically also at least one so-called hidden neuron layer is present with K1 neurons. In general it can also be N1×K1×M1, as shown in FIG. 1, when M1 neurons are present in the output layer. However, typically also at least one so-called hidden neuron layer is also present with K1 neurons. Across stages the neuron layers also need to communicate, but not all connections then need to be present, so no full crossbar is then needed any more. That is partly the case already between the input and output layer when also a hidden layer is present (see FIG. 1 where not all possible connections are realizable between the N1 input neurons and M1 output neurons, except when K1=N1×M1, which would lead to too much cost overhead). That is especially so when several clusters of densely connected neural arrays (in the literature also referred to as stages) are communicating with each other (see FIG. 2 where two clusters are shown with their mutual L1 connections that form a subset of all possible connections). However, upfront at fabrication time it is not known which specific neuron connections (which of the L1 connections in FIG. 2) are needed later. Moreover, building a neuromorphic system comprising only one single dense neural cluster clearly does not scale in terms of connections. Hence, there clearly is a big challenge in finding the best global synapse connection approach across the clusters, supported with an appropriately matched communication network architecture.
Many research projects have initiated and expanded the neuromorphic computing domain. Many of those initiatives are focused on one aspect of the system and do not cover the overall scheme. They mostly address the dense local synapse array using traditional SRAMs or emerging non-volatile memories like phase-change memory (PCM), resistive RAM (ReRAM) or STT-MRAM. In IBM's TrueNorth initiative and in the Human Brain Project the overall scheme is addressed but they use more conventional technology, namely CMOS logic and SRAM/DRAM memories.
Looking more into detail at the global synapse communication problem, as formulated above, there is also a need for scalable solutions which also provide a broad applicability.
A similar observation can be made for inter-core communication networks in SoCs.
Some alternate approaches to solve the global inter-cluster communication bottleneck with low energy while still covering a (very) wide application range, are now discussed with more technical details. Existing solutions can roughly be divided into a number of categories.
A first set of solutions is characterized by a restricted connectivity. Rather regular locally connected architectures are usually used in this approach, similar to systolic arrays. Two main options are available for time-multiplexing: Local Sequential Global Parallel (LSGP) or the opposite (LPGS). Initially these are formulated for a single stage, but this can be generalized to multiple stages. A main trade-off exists in these solutions in the local storage vs bandwidth requirement. N nodes are assumed with √{square root over (N)} parallel nodes that are time-multiplexed with a time-multiplexing factor √{square root over (N)}. Then LSGP has N data stored and 4√{square root over (N)} transfers. LPGS has √{square root over (N)} data stored and 4N transfers. The LSGP can provide a better match to the back-end-of-line (BEOL) capacitance and architecture bandwidth bottlenecks. However, this is still not so attractive because the targeted classes of applications/algorithms then have to be (too) heavily restricted. The Spinnaker project of the University of Manchester is e.g. mostly based on this with heavy time multiplexing, restricting global data connections.
In a second category of alternate solutions full connectivity is maintained. Both LSGP and LPGS then require N(N−1)=N2 data transfers, which is not scalable to brain-like dimensions with at least 1010 neurons. A human brain has a reduction from N2=1020 to 1015 synapses and these are still mostly inactive for a large part of the instantiated processing. Some projects still try to scale up in this way, including strong time-multiplexing. To implement hidden layers more effectively, it is then best to use LPGS where the highly dynamic global connectivity can be exploited in a flexible time-multiplexed software-enabled way. Intra neural cluster connection is more “static”, so it is most suitable to link that to the spatially parallel hardware domain. One then still has to take care that interconnections are not too long though by e.g. limiting the intra cluster size. This creates a first new subbranch. An alternative new subbranch is obtained if one opts for a more dynamic architectural solution. These two new subbranches are further discussed below. Note however that all this is generalizable also for a multi-core SoC which requires a high amount of cluster connections with a large data bandwidth.
The first subbranch comprises solutions with static full connectivity. Multi-stage networks have some form of cross-bar implementation. These still require a huge area and energy overhead for large N involving N2 transfers. A partial solution exists in power-gating all connections not required during the actual running of an application instance, in this way restricting the overall energy. Then still the same area is required and consequently, still a strong energy overhead remains in scaled technology nodes due to the needlessly long lines in the oversized layout. The TrueNorth project uses this approach. However, this solution is still not attractive due to the lack of full scalability and of sufficient parallelism. It requires a huge energy budget, so it is not suited for embedded portable usage, only for “shared servers in the cloud”. Also then it is only for server farms with a large power plant, which does not include distributed warehouse servers, which have to be plugged into the local power supply.
Solutions in the second subbranch have dynamic full connectivity. They exploit the fact that longer inter-cluster connections are needed more rarely. It is not known upfront where these connections are situated though, so a run-time layer is needed to accommodate the actual transfers at instantiation time. One way to achieve dynamic full connectivity is exploiting hardware based control protocols using some type of statically allocated Network-on-Chip (NoC) or multi-stage network approach. This approach is adopted e.g. in the paper “A Memory-Efficient Routing Method for Large-Scale Spiking Neural Networks” (S. Moradi et al., Eur. Conf. on Circuit Theory and Design (ECCTD) 2013, September 2013, pp. 1-4). A Local Parallel Global Sequential (LPGS) is used there to obtain a parallel implementation of a quite strongly connected “static” intra-cluster organization and a largely sequential (time-multiplexed) implementation of more sparsely connected time-varying inter-cluster communication.
Application US2015/058268 (IBM) presents a hierarchical, scalable neuromorphic synaptronic system for synaptic and structural plasticity. However, the obtained scalability is limited: local connections are performed with “sparse crossbar tables”, which however that does not allow realizing global connections in a fully flexible way. The system is still dimensioned at design time. The proposed solution does not achieve scalability and low power simultaneously.
Hence, there is a need for alleviating the intermediate length interconnection problems encountered in global data communication networks connecting a plurality of computation clusters.