1. Field of the Invention
The present invention relates to the field of computer memory devices and, more particularly to the connection of a massively parallel processor array to a memory array in a bit serial manner to effect a byte wide data reorganization.
2. Description of the Related Art
The fundamental architecture used by all personal computers (PCs) and workstations is generally known as the von Neumann architecture, illustrated in block diagram form in FIG. 1. In the von Neumann architecture, a main central processing unit (CPU) 10 is used to sequence its own operations using a program stored in a memory 12. The memory 12, referred to herein as “main memory”, also contains the data on which the CPU 10 operates. In modern computer systems, a hierarchy of cache memories is usually built into the system-to reduce the amount of traffic between the CPU 10 and the main memory 12.
The von Neumann approach is adequate for low to medium performance applications, particularly when some system functions can be accelerated by special purpose hardware (e.g., 3D graphics accelerator, digital signal processor (DSP), video encoder or decoder, audio or music processor, etc.). However, the approach of adding accelerator hardware is limited. by die bandwidth of the link from the CPU/memory part of the system to the accelerator. The approach may be further limited if the bandwidth is shared by more than one accelerator. Thus, the processing demands of large data sets, such as those commonly associated with large images, are not served well by the von Neumann architecture. Similarly, as the processing becomes more complex and the data larger, the processing demands will not be met even with the conventional accelerator approach.
It should be noted, however, that the von Neumann architecture has some advantages. For example, the architecture contains a homogenous memory Aft structure allowing large memories to be built from many smaller standard units. In addition, because the processing is centralized, it does not matter where the data (or program) resides in the memory. Finally, the linear execution model is easy to control and exploit. Today's operating systems control the allocation of system memory and other resources using these properties. The problem is how to improve i processing performance in a conventional operating system environment where multiple applications share and partition the system resources, and in particular, the main memory.
One solution is to utilize active memory devices, as illustrated in FIG. 2, in the computer system. Put simply, active memory is memory that can do more than store data; it can process it too. To the CPU 10 the active memory looks normal except that it can be told to do something with the data contents and without the data being transferred to the CPU or another part of the system (via the system bus). This is achieved by distributing processing elements (PEs) in an array 14 through out the memory structure, which can all operate on their own local pieces of memory in parallel. In addition, each PE 16 within the PE array 14 typically communicates with each other, as illustrated in FIG. 3, to exchange data. Thus, active memory encourages a somewhat different view of the computer architecture, i.e., “memory centered” or viewed from the data rather than the processor.
In a computer system having active memory, such as illustrated in FIG. 2, the work of the CPU 10 is reduced to the operating system tasks, such as scheduling processes and allocating system resources and time. Most of the data processing is performed within the memory 12. By having a very large number of connections between the main memory 12 and the processing resources, i.e., the PE array 14, the bandwidth for moving data in and out of memory is greatly increased. A large number of parallel processors can be connected to the memory 12 and can operate on their own area of memory independently. Together these two features can provide very high performance.
There are several different topologies for parallel processors. One example topology is commonly referred to as SIMD (single instruction, multiple data). The SIMD topology contains many processors, all executing the same stream of instructions simultaneously, but on their own (locally stored) data. The active memory approach is typified by SIMD massively parallel processor (MPP) architectures. In the SIMD MPP, a very large number of processors (usually a thousand or more) of relatively simple PEs are closely connected to a memory and organized so that each PE has access to its own piece of memory. All of the PEs execute the same instruction together, but on different data. The instruction stream is generated by a controlling sequencer or processor.
The SIMD MPP has the advantage that the control overheads of the system are kept to a minimum, while maximizing the processing and memory access bandwidths. SIMD MPPs, therefore, have the potential to provide very high performance very efficiently. Moreover, the hardware consists of many fairly simple repeating elements. Since the PEs are quite small in comparison to a reduced instruction set computer (RISC), they are quick to implement into a system design and their benefit with respect to optimization is multiplied by the number of processing elements. In addition, because the PEs are simple, it is possible to clock them fast and without resorting to deep pipelines.
In one exemplary massively parallel processor array, each PE 16 in the PE array 14 uses only a single pin to connect to the memory 12. Thus, a one bit wide data connection is provided. When this is done, data is stored “bit serially” so that successive bits of a binary value are stored at successive locations in the memory 12. This storage format is referred to as “vertical” storage. Thus data read from and written to each PE will be read and stored, respectively, “vertically” in successive locations in the memory 12 as illustrated in FIG. 4. Thus, in FIG. 4, if each PE 16a-16n in a row 22 of PE array 14 is an eight bit PE, i.e., it operates on eight bits of data at a time, the data in the memory will be stored in eight successive vertical locations as illustrated. As noted above, each PE is connected to memory 12 by a one bit wide data connection 24. Thus, data from PE 16c will be stored in a byte sized area 20 of memory 12 in successive locations in area 20, i.e., it will be stored vertically as illustrated by arrow 30. The storage of data bit serially has a number of benefits. First, the number of data wires per PE 16 to the memory 12 is kept to a minimum. Second, it allows for variable precision arithmetic to be more easily and efficiently implemented. For example, ten, twelve, or fourteen bit numbers can be stored and processed efficiently. Third, in some cases, the difference in speed of the memory access versus the PE cycle time can be matched by serializing the data access.
There are some drawbacks, however, with storing the data from the PE array 14 bit serially. For example, in most applications, a chip containing a SIMD MPP array 14 and its associated memory 12 will have some form of off-chip interface which allows an external device, such as for example CPU 10 as illustrated in FIG. 2, to access the on-chip memory 12. CPU 10 sees data stored word-wide, i.e., “horizontally” as illustrated by arrow 32 in FIG. 4, referred to as normal mode. Thus, for external devices to access data stored vertically requires that the data be reorganized, i.e., converted, to the normal mode before being transferred from the memory to the external device, or converted by the external memory device before it can be used.
Converting between the two formats, i.e., normal and vertical, can be performed within the PE array 14 or within the external device that needs access to the data, but it would be more efficient to store the data in a single format, thus avoiding having to store it in one format and convert it to another. Preferably, the single format would be the normal format used by the external devices.
Thus, there exists a need for a connection between a PE array and main memory in a MPP such that software data conversion is not required, and data can be stored in a normal mode or vertical mode in the memory.