The present invention relates to the field of massively parallel processing systems, and more particularly to the interconnection among processing elements and between processing elements and memory in a single chip massively parallel processor chip.
The fundamental architecture used by all personal computers (PCs) and workstations is generally known as the von Neumann architecture, illustrated in block diagram form in FIG. 1. In the von Neumann architecture, a main central processing unit (CPU) 10 is coupled via a system bus 11 to a memory 12. The memory 12, referred to herein as xe2x80x9cmain memoryxe2x80x9d, also contains the data on which the CPU 10 operates. In modern computer systems, a hierarchy of cache memories is usually built into the system to reduce the amount of traffic between the CPU 10 and the main memory 12.
The von Neumann approach is adequate for low to medium performance applications, particularly when some system functions can be accelerated by special purpose hardware (e.g., 3D graphics accelerator, digital signal processor (DSP), video encoder or decoder, audio or music processor, etc.). However, the approach of adding accelerator hardware is limited by the bandwidth of the link from the CPU/memory part of the system to the accelerator. The approach may be further limited if the bandwidth is shared by more than one accelerator. Thus, the processing demands of large data sets, such as those commonly associated with large images, are not served well by the von Neumann architecture. Similarly, as the processing becomes more complex and the data larger, the processing demands will not be met even with the conventional accelerator approach.
It should be noted, however, that the von Neumann architecture has some advantages. For example, the architecture contains a homogenous memory structure allowing large memories to be built from many smaller standard units. In addition, because the processing is centralized, it does not matter where the data (or program) resides in the memory. Finally, the linear execution model is easy to control and exploit. Today""s operating systems control the allocation of system memory and other resources using these properties. The problem is how to improve processing performance in a conventional operating system environment where multiple applications share and partition the system resources, and in particular, the main memory.
One solution is to utilize active memory devices, as illustrated in FIG. 2, in the computer system. Put simply, active memory is memory that can do more than store data; it can process it too. To the CPU 10 the active memory 15 looks normal except that it can be told to do something with the data contents and without the data being transferred to the CPU or another part of the system (via the system bus 11). This is achieved by distributing an array 14 of processing elements (PEs) 200 throughout the memory structure, which can all operate on their own local pieces of memory in parallel. The array 14 of PEs 200 are coupled to the memory 12 via an high speed connection network 13. In addition, PEs 200 of the array 14 can communication with each other. Thus, active memory encourages a somewhat different view of the computer architecture, i.e., xe2x80x9cmemory centeredxe2x80x9d or viewed from the data rather than the processor.
In a computer system having active memory, such as illustrated in FIG. 2, the work of the CPU 10 is reduced to the operating system tasks, such as scheduling processes and allocating system resources and time. Most of the data processing is performed within the memory 15. By having a very large number of connections between the main memory 12 and the processing resources, i.e., the array 14 of PEs 200, the bandwidth for moving data in and out of memory 12 is greatly increased. A large number of parallel processors can be connected to the memory 12 and can operate on their own area of memory independently. Together these two features can provide very high performance.
There are several different topologies for parallel processors. One example topology is commonly referred to as SIMD (single instruction, multiple data). The SIMD topology contains many processors, all executing the same stream of instructions simultaneously, but on their own (locally stored) data. The active memory approach is typified by SIMD massively parallel processor (MPP) architectures. In the SIMD MPP, a very large number (for example, one thousand) of relatively simple PEs 200 are closely connected to a memory and organized so that each PE 200 has access to its own piece of memory. All of the PEs 200 execute the same instruction together, but on different data.
The SIMD MPP has the advantage that the control overheads of the system are kept to a minimum, while maximizing the processing and memory access bandwidths. SIMD MPPs, therefore, have the potential to provide very high performance very efficiently. Moreover, the hardware consists of many fairly simple repeating elements. Since the PEs 200 are quite small in comparison to a reduced instruction set computer (RISC), they are easy to implement into a system design and their benefit with respect to optimization is multiplied by the number of processing elements. In addition, because the PEs 200 are simple, it is possible to clock them fast and without resorting to deep pipelines.
In a massively parallel processor array, the design of the interconnections among the processing elements and the interconnections between the PEs 200 and the memory 12 are an important feature. Traditional massively parallel processors utilize a plurality of semiconductor chips for the processor element array 14 and the memory 12. The chips are connected via a simple network of wires. However, as shown in FIG. 3, advances in semiconductor technology now permits a SIMD massively parallel processor with a memory to be integrated onto a single active memory chip 100. Since signals which are routed within a semiconductor chip can travel significantly faster than inter-chip signals, the single chip active memory 100 has the potential of operating significantly faster than a prior art SIMD MPP. However, achieving high speed operation requires more than merely integrating the elements of a traditional prior art SIMD MPP into one active memory chip 100. For example, careful consideration must be given to the way the PEs 200 of the PE array 14 are wired together, since this affects the length of the interconnections between the PEs 200 (thereby affecting device speed), the mapping of the memory from as seen by the PEs 200, the power consumed to drive the interconnection network, and the cost of the active memory chip 100. Accordingly, there is a desire and need for an affordable high speed SIMD MPP active memory chip with an optimized interconnection arrangement between the PEs.
In one aspect, the present invention is directed to a single chip active memory with a SIMD MPP. The active memory chip contains a full word interface, a memory in the form of a plurality of memory stripes, and a PE array in the form of a plurality of PE sub-arrays. The memory stripes are arranged between and coupled to both the plurality of PE sub-arrays and the full word interface. Each PE sub-array is coupled to the full word interface and a corresponding memory stripe. In order to route the numerous couplings between a memory stripe and its corresponding PE sub-array, the PE sub-array is placed so that its data path is orthogonal to the orientation of the memory stripes. The data lines of the PE sub-arrays are formed on one metal layer and coupled to the memory stripe data lines which are formed on a different metal layer having an orthogonal orientation.
In another aspect of the present invention, the PEs each contain a small register file constructed as a small DRAM array. Small DRAM arrays are sufficiently fast to serve as a register file and utilize less power and semiconductor real estate than traditional SRAM register files.
In another aspect of the invention, the PE array of the active memory chip is formed by coupling the plurality of PE sub-arrays into a single logical array in accordance to a mapping technique. The mapping technique of the invention include mapping each PE sub-array into the logical array as a row (optionally with row interleaving), a rectangular region, or a column. Each PE of the logical array is coupled to four other PEs along its (logical) north, south, east, and west axis. PEs which are located at the corners or along the edges of the logical array have couplings along their exterior edges which wrap around the array to opposite corner and edge PEs, respectively. Depending on the mapping, some PEs may be coupled to other PEs which are (physically) distant and the present invention uses current mode differential logical couplings and drivers for its long distance PE-to-PE couplings.