A simple computer generally includes a central processing unit (CPU) and a main memory. The CPU implements a sequence of operations encoded in a stored program. The program and the data on which the CPU acts are typically stored in the main memory. The processing of the program and the allocation of main memory and other resources are controlled by an operating system. In operating systems where multiple applications may share and partition resources, the processing performance of the computer can be improved through use of active memory.
Active memory is memory that processes data as well as storing it. It can be instructed to operate on its contents without transferring its contents to the CPU or to any other part of the system. This is typically achieved by distributing parallel processors throughout the memory. Each parallel processor is connected to the memory and operates on it independently. Most of the data processing can be performed within the active memory and the work of the CPU is thus reduced to the operating system tasks of scheduling processes and allocating system resources.
A block of active memory typically consists of the following: a block of memory, e.g. dynamic random access memory (DRAM), an interconnection block and a memory processor (processing element array). The interconnection block provides a path that allows data to flow between the block of memory and the processing element array. The processing element array typically includes multiple identical processing elements controlled by a sequencer. Processing elements are generally small in area, have a low degree of hardware complexity, and are quick to implement, which leads to increased optimisation. Processing elements are usually designed to balance performance and cost. A simple more general-purpose processing element will result in a higher level of performance than a more complex processing element because it can ,easily be coupled to many identical processing elements. Further, because of its simplicity, the processing element will clock at a faster rate.
In any computer system, it is important that data can be made available to the processor as quickly as possible. In a parallel processor, the organisation of data in the processing element array is an important part of the execution of many algorithms. Hence, the provision of an efficient means of moving data from one processing element to another is an important consideration in the design of the processing element array.
In the past, several different methods of connecting processing elements have been used in a variety of geometric arrangements, including hypercubes, butterfly networks, one-dimensional strings/rings and two-dimensional meshes In a two-dimensional mesh, the processing elements are arranged in rows and columns, with each processing element being connected to its four neighbouring processing elements in the rows above and below and the columns either side (directions herein referred to as north, south, east and west).
In current systems, movement of data between the processing elements generally occurs as a parallel operation, i.e. every processing element sends and receives data from other processing elements at the same time. Data can be viewed as being shifted in one of four directions (north, south, east and west) along all of the processing elements in the rows or columns.
In addition, there may also be a column of edge registers located along an east or west side of the processing element array and a row of edge registers located along an north or south side, each register being connected to the processing elements at both ends of every row or column. The edge registers permit data to be shifted into or out of the processing element array as data is shifted along the rows or columns.
One problem with current system and methods of shifting data between processing elements in a processing element array is that every processing element has to send and receive data at the same time. Thus, movement of data around the processing element array can generally only occur in a limited number of different transformations and movement of data between two processing elements, which are not neighbours, has to take place over a number of shift operations. It is therefore desirable to reduce the number of shift operations that are required to move data between non-neighbouring processing elements.
Accordingly, it is an object of the present invention to provide a more efficient means of moving data from one processing element to another.
It is a further object of the present invention to provide a more flexible parallel processor in which data can be moved easily between non-neighbouring processing elements in a single operation.