FIG. 1 is a block diagram of a conventional CPU architecture 100 that includes a plurality of processor chips 101-102, a chip-to-chip interconnect 105 and DRAM devices 111-112. Each of the processor chips 101 and 102 includes a plurality of processor cores C01-C0N and C11-C1N, respectively. Each of the processor cores includes a register file and arithmetic logic unit (ALU), a first level cache memory L1, and a second level cache memory L2. Each of the processor chips 101 and 102 also includes a plurality of third level (L3) cache memories 121 and 122, respectively, and cache coherence interconnect logic 131 and 132, respectively.
In general, the first level cache memory L1 allows for fast data access (1-2 cycles), but is relatively small. The second level cache memory L2 exhibits slower data access (5-6 cycles), but is larger than the first level cache memory. Each of the processor cores C01-C0N and C11-C1N has its own dedicated first level cache memory L1 and second level cache memory L2. Each of the processor cores C01-C0N on chip 101 accesses the plurality of level three (L3) cache memories 121 through cache coherence interconnect logic 131. Similarly, each of the processor cores C11-C1N on chip 102 accesses the plurality of level three (L3) cache memories 122 through cache coherence interconnect logic 132. Thus, the plurality of processor cores on each chip share the plurality of level three (L3) cache memories on the same chip.
Each of the processor cores C01-C0N on chip 101 accesses the DRAM 111 through cache coherence interconnect logic 131. Similarly, each of the processor cores C11-C1N on chip 102 accesses the DRAM 112 through cache coherence interconnect logic 132.
Cache coherence interconnect logic 131 ensures that all of the processor cores C01-C0N see the same data at the same entry of the level three (L3) cache 121. Cache coherence interconnect logic 131 resolves any ‘multiple writer’ problems, wherein more than one of the processor cores C01-C0N attempts to update the data stored by the same entry of the level three (L3) cache 121. Any of the processor cores C01-C0N that wants to change data in the level three (L3) cache 121 must first obtain permission from the cache coherence interconnect logic 131. Obtaining this permission undesirably requires a long time and involves the implementation of a complicated message exchange. Cache coherence interconnect logic 131 also ensures coherence of the data read from/written to DRAM 111.
Cache coherence interconnect logic 132 similarly ensures coherence of the data stored by the L3 cache 122 and data read from/written to DRAM 112.
Chip-to-chip interconnect logic 105 enables communication between the processor chips 101-102, wherein this logic 105 handles necessary changes of protocols across the chip boundaries.
As illustrated by FIG. 1, conventional CPU architecture 100 implements a plurality of cache levels (L1, L2 and L3) that have a cache hierarchy. Higher level cache memories have a relatively small capacity and a relatively fast access speed (e.g., SRAM), while lower level cache memories have a relatively large capacity and a relatively slow access speed (e.g., DRAM). A cache coherence protocol is required to maintain data coherence across the various cache levels. The cache hierarchy makes it difficult to share data among multiple different processor cores C01-C0N and C11-C1N due to the use of dedicated primary (L1 and L2) caches, multiple accesses controlled by cache coherence policies, and the required traversal of data across different physical networks (e.g., between processor chips 101 and 102).
Cache hierarchy is based on the principle of temporal and spatial locality, so that higher level caches will hold the displaced cache lines from lower level caches in order to avoid long latency accesses in the case where the data will be accessed in the future. However, if there is minimal spatial and temporal locality in the data set (as is the case in for many neural network data sets), then latency is increased, the size of the useful memory locations is reduced, and the number of unnecessary memory accesses is increased.
The hardware of conventional CPU architectures (such as architecture 100) is optimized for the Shared Memory Programming Model. In this model, multiple compute engines communicate via memory sharing using a cache coherence protocol. However, these conventional CPU architectures are not the most efficient way to support a Producer-Consumer execution model, which is typically implemented by the forward propagation of a neural network (which exhibits redundant memory read and write operations as well as a long latency). In a Producer-Consumer execution model, the passing of direct messages from producers to consumers is more efficient. In contrast, there is no hardware support for direct communication among the processor cores C01-C0N and C11-C1N in the Shared Memory Programming Model. The Shared Memory Programming model relies on software to build the message passing programming model.
The communication channels at each level of a conventional CPU architecture 100 optimized for the Shared Memory Programming Model are highly specialized and optimized for the subsystems being served. For example, there are specialized interconnect systems: (1) between the data caches and the ALU/register file, (2) between different levels of caches, (3) to the DRAM channels, and (4) in the chip-to-chip interconnect 105. Each of these interconnect systems operates at its own protocol and speed. Consequently, there is significant overhead required to communicate across these channels. This incurs significant inefficiency when trying to speed up tasks that require access to a large amount of data (e.g., a large matrix multiplication that uses a plurality of computing engines to perform the task).
Crossbar switches and simple ring networks are commonly used to implement the above-described specialized interconnect systems. However, the speed, power efficiency and scalability of these interconnect structures are limited.
As described above, conventional CPU architectures have several inherent limitations in the implementation of neural networks and machine learning applications. It would therefore be desirable to have an improved computing system architecture that is able to more efficiently process data in neural network/machine learning applications. It would further be desirable to have an improved network topology capable of spanning multiple chips, without requiring cache coherency protocol between the multiple chips. It would further be desirable if such a multi-chip communication system to be easily scalable, capable of providing communication between many different chips.