A simple computer generally includes a central processing unit (CPU) and a main memory. The CPU implements a sequence of operations encoded in a stored program. The program and data on which the CPU acts is typically stored in the main memory. The processing of the program and the allocation of main memory and other resources are controlled by an operating system. In operating systems where multiple applications may share and partition resources, the processing performance of the computer can be improved through use of active memory.
Active memory is memory that processes data as well as storing it. It can be instructed to operate on its contents without transferring its contents to the CPU or to any other part of the system. This is typically achieved by distributing parallel processors throughout the memory. Each parallel processor is connected to the memory and operates on it independently of the others. Most of the data processing is performed within the active memory and the work of the CPU is thus reduced to the operating system tasks of scheduling processes and allocating system resources.
A block of active memory typically consists of the following: a block of memory, e.g. dynamic random access memory (DRAM), an interconnection block, and a memory processor (processing element array). The interconnection block provides a path that allows data to flow between the block of memory and the processing element array. The processing element array typically includes multiple identical processing elements controlled by a sequencer. Processing elements are generally small in area, have a low degree of hardware complexity, and are quick to implement, which leads to increased optimisation. Processing elements are usually designed to balance performance and cost. A simple more general-purpose processing element will result in a higher level of performance than a more complex processing element because it can be easily coupled to generate many identical processing elements. Further, because of its simplicity, the processing element will clock at a faster rate.
In any computer system, it is important that data is processed efficiently in order to maximise the speed of the processor. In a parallel processor containing a plurality of processing elements, it is important to maximise the speed of movement of data from an input to the processing element through processing logic to an output of the processing element.
Moreover, it is important to ensure that data generated by one part of the processing element is ready use by another part or by another processing element as and when it is required.
In a parallel processor, in which there is a plurality of processing elements, in addition to transferring data between a particular processing element and its memory or host CPU, often data is transferred between the individual processing elements. This added complexity further increases the complexity of inputting and outputting data from the processing element and can further reduce the speed of the processing element.
Accordingly, it is an object of the present invention to provide efficient scheduling and transfer of data within the processing element.
It is a further object of the present invention to provide a more flexible processing element, within which data can be efficiently transferred between components of the processing element.
It is yet a further object of the present invention to provide faster transfer out of the processing element of results of processing operations occurring therein.