This invention broadly relates to parallel processing in the field of computer technology, and more particularly concerns systems, devices and methods for transferring data in an efficient manner for a parallel computer such as a Single Instruction Multiple Data (SIMD) data processor.
Parallel processing is increasingly used to meet the computing demands of the most challenging scientific and engineering problems, since the computing performance required by such problems is usually several orders of magnitude higher than that delivered by general-purpose serial computers.
Whilst different parallel computer architectures support differing modes of operation, in very general terms, the core elements of a parallel processor include a network of processing elements (PEs) each having one or more data memories and operand registers, with each of the PEs being interconnected through an interconnection network (IN).
One of the most extensively researched approaches to parallel processing concerns Array Processors, which are commonly embodied in single instruction stream operating on multiple data stream processors (known as Single Instruction Multiple Data or SIMD processors). The basic processing units of an SIMD processor are an array of processing elements (PEs), memory elements (M), a control unit (CU), and an interconnection network (IN). In operation, the CU fetches and decodes a sequence of instructions from a program, then synchronises all the PEs by broadcasting control signals to them. In turn, the PEs, operating under the control of a common instruction stream, simultaneously execute the same instructions but on the different data that each fetches from its own memory. The interconnection network facilitates data communication among processing units and memory. Thus the key to parallelism in SIMD processors is that one instruction operates on several operands simultaneously rather than on a single one.
An example of such a data processor architecture is disclosed in International Patent Application No. PCT/CA02/00299, Publication No. WO 02/071246, to Atsana Semiconductor Corporation, the entire contents of which is incorporated herein by reference. An example of a data processor disclosed in this document is shown in FIG. 1. The apparatus comprises a memory block 1, a two-dimensional array of processor elements (PEs) 3 each of which can be coupled to the memory via a switching element (SE) 5. Each processor element may comprise a single bit processor element and a computational unit (CU) comprises a predetermined number of processor elements generally formed from a row of contiguous PEs, as shown in FIG. 2. Data from the memory block can be downloaded into each CU sequentially and data processing may be performed by each CU sequentially, row by row, or simultaneously, (i.e. in parallel) once data has been downloaded into all CUs.
One of the main advantages of the memory time-multiplex CU architecture is that the vertical transfer of data between the processor elements (or CUs) can be performed very efficiently. For example, referring to FIG. 1, it would only take one cycle to load data from one row of PEs to another row of PEs, for example row n into row 0. In a previous architecture it would take n−1 cycles (assuming it is possible to write from CU register to neighbouring CU register through the switching element(s) 5.
Another advantage of this architecture is that a deeper memory can be used (e.g. 1024 rows or greater) because the memory requirement per CU can be shared, and communication through the Switching Element is minimized allowing more time for memory accesses. For example, one implementation may require 4 kbytes/CU, and therefore if the memory is shared between 4CU's, this would mean that a 16 kbyte deep memory could be used.
As multiple CUs share the same memory space, CU accesses to memory must be pipelined. This means that each row of CUs is loaded with data from the memory in successive cycles so that, for example, the row 0 CUs are loaded with data in one cycle followed by the row 1 CUs in the next cycle, followed by the row 2 CUs in the next cycle, and so on to row n. This is illustrated in the timing diagram 28 in FIG. 2, which shows successive data download cycles from the memory, where “DATA 0” in is the cycle in which data is downloaded from the memory into the first Computational Unit CUO, and so on. As mentioned above, there are two different ways of processing the data row from memory: the first is to pipeline operations, and the second is to wait until all the memory reads are complete and have the CUs operate simultaneously. The reason that this architecture improves timing between the memory and the CU is that the data output from memory only goes through a single switching stage, and the interconnect between that switching stage and its nearest neighbour is very short. A problem with this architecture is that the latency of the device is increased relative to an unpipelined structure in which each processor is arranged in a one dimensional array and has its own dedicated section of memory so that all processors perform memory reads in parallel.