In recent years, heterogeneous computing has become prominent in an increasing number of application areas. Of note is the use of graphics processing units (GPUs) and other specialized coprocessors in mainstream computational equipment in areas such as video displays and gaming, digital signal processing (DSP), image processing, machine learning, big data, high performance computing, network packet processing, data encryption, and others. These coprocessors are often used to support a homogeneous cluster of central processing units (CPUs) or micro processing units (MPUs) which function as a system's main processor.
Many of the heterogeneous co-processors are implemented using compute arrays which are parallel computing architectures comprising rows and columns of homogeneous data processing units (DPUs). The benefit is that repeated calculations on partial results can be passed on from DPU to DPU and performed completely within the array without any need to access external resources such as caches, main memory, busses, etc. This avoids many of the bottlenecks present in more conventional complex instruction set computing (CISC) or reduced instruction set computing (RISC) compute architectures.
FIG. 1 illustrates an exemplary and simplified DPU 100 of a type known in the art. DPU 100 comprises a number of value inputs 102, an input multiplexer 104, a value memory 106, a coefficient memory 108, a multiply and accumulate circuit 110, and a value output 112.
DPU 100 is part of an array (not shown) of many DPUs 100 arranged in rows and columns. The value inputs 102 are coupled to a plurality of value outputs 112 in a plurality of the other DPUs 100 in the array. Similarly, value output 112 is coupled to one or more value inputs 102 in other DPUs 100 in the array.
Multiplexer 104 selects between the various value inputs 102 and directs them to value memory 106 where their values are stored until needed by multiply and accumulate circuit 110.
Coefficient memory 108 stores a plurality of coefficients to be processed along with the values stored in value memory 106. In exemplary DPU 100, the multiply and accumulate circuit 110 accesses a value from value memory 106 and a coefficient from coefficient memory 108, multiplies them together, and adds the result to the sum of previous multiplications of value-coefficient pairs. Value memory 106 and coefficient memory 108 may, for example, be either random access memories (RAM) or first in/first out (FIFO) memories. In embodiments employing FIFOs, the loopback connection around coefficient memory 108 may be used for cycling the same coefficients repeatedly through the coefficient memory 108 while new sets of values are continuously passed through the value memory 106 once per data set. The results from multiply and accumulate circuit 110 is then presented to other DPUs 100 in the array through value output 112.
The purpose of the array is to perform a large number of multiply and accumulate operations in both series and in parallel. Each DPU 100 is a relatively small circuit. The number of bits of the values and coefficients as well as the depths of value memory 106 and coefficient memory 108 are determined by the application and are a matter of design choice. Persons skilled in the art will appreciate that DPU 100 is a very generic compute unit and that many possible compute units performing similar or other operations, both known in the art and yet to be invented, may be combined in similar compute arrays.
The ubiquity of data processing devices from cell phones, tablets, sensors, security and other cameras, the Internet of things (IOT), and other battery operated devices, makes it highly desirable to have compute arrays that are small, inexpensive, and low in power consumption. In particular, it is desirable to pair up compute array DPUs with appropriately sized, low powered, inexpensive memories. Unfortunately, monolithic solutions like embedded static random access memory (eSRAM) or embedded dynamic random access memory (eDRAM) come with substantial area overhead costs. Using external memory chips is even more expensive and the external interfaces use unacceptable power levels for independent, mobile, and other battery powered devices.