The present invention concerns improvements relating to data addressing techniques and more particularly, though not exclusively to a method of performing address calculation for a data-parallel SIMD processor that provides enhanced performance.
Single Instruction Parallel Data (SIMD) processors are a class of data-parallel processors in which a plurality of processing elements making up the processor perform the same operation at the same time but on different data. SIMD processors have a relatively large array of identical processing elements coupled together with a single control unit. Each processing element acts as a small basic computer in that it has an Arithmetic Logic Unit (ALU) a plurality of data registers and a local data memory. The control unit functions to control simultaneously the operations carried out on each processing element and these processes basically consist of reading the data from the data memory, carrying out an operation on the data in the ALU, which may involve manipulation of the data using the data registers, and then writing the result back to memory.
Existing data-parallel SIMD processors typically process data sets loaded into the data memories of the processing elements from an external data source such as a video memory. The loaded data can then be processed in parallel by the SIMD processor. Typically, the data sets in question are highly regular data, for example two-dimensional images (such as pixel arrays) or three-dimensional volume data (such as voxel arrays). Given the inherent regularity of the data sets, the transfer of data into the distributed data memories of the data-parallel processor array can normally be accomplished by a conventional DMA (Direct Memory Access) unit, or a dedicated unit with similar (although enhanced) memory data handling functionality. Examples of this type of memory can be seen in U.S. Pat. Nos. 4,835,729 and 5,581,773.
When more complex address transformations are required, when for example the addresses of the data items are non-contiguous, then a conventional CPU (Central Processing Unit), with its complex address calculation support, actually offers superior performance over the DMA unit solution. This performance differential is still maintained even if the data processing part of the problem is inherently data-parallel. Accordingly, one performance enhancing solution is to employ a ‘conventional’ processor to precompute the addresses and prefetch the data vectors for the SIMD parallel processor to process.
The disadvantage of this prior art solution is that it is a non-optimal solution that does not take advantage of the inherent parallel nature of the data-parallel processor to compute the address vectors itself. Moreover, if a conventional processor is used to compute the addresses this may itself become the bottleneck, if it cannot keep pace with the enormous demand for data from the parallel processor or, conversely, if the address computational task is too onerous.
A specific class of SIMD processor is the ASP (Associative String Processor). This technology utilizes a hierarchy of memory transfers to secure the transfer of data from a conventional memory to the data-parallel processor. This is shown in FIG. 1 where the hierarchy of system memory transfer elements of an ASP 10 are shown. Here a tertiary memory 12 acts as a conventional memory and this is coupled in turn to a Secondary Data Store (SDS) 14, a Primary Data Store (PDS) 16 and finally a data-parallel processor 18. The SDS 14 has the ability to store regular (contiguous) data and to supply that regular data to the PDS in an optimized manner as has been described above in relation to a DMA data transfer. However, the SDS 14 also stores non-contiguous data which is supplied in a non-optimal manner. The PDS 16 is tightly coupled to the data-parallel processor 18 and is analogous to a data cache.
More specifically, referring to FIG. 2 a memory access controller 28 of a conventional ASP 10 comprises at its heart a Secondary Data Movement Controller (SMDC) 30 which generates addresses and coordinates data transfers to and from the data-parallel processor 18 and external memories. One such external memory is the SDS 14 and data is supplied and stored to the SDS 14 via a secondary data memory bus 34. Also, the SDMC 30 is coupled to tertiary memory 12 via a tertiary memory interface 32. The data memory bus 34 is controlled by a Secondary Memory Bus Interface and Arbiter (bus arbiter) 36 which handles requests for data generated by the SMDC 30 and the transfer of data to the SMDC 30 from the SDS 14. As has been described previously, the SDS 14 is used for storing contiguous data such as video data and the regularity of this data enables the SDMC 30 to carry out block data transfers from the SDS 14 to itself and then to onto the data-parallel processor 18. In this regard, the SDMC 30 has the role of a sophisticated DMA unit, responsible for movement of regular data sets (i.e. 2D arrays of pixels or 3D arrays of voxels).
The data transfer facilities also include a secondary data transfer interface 38 and a primary data memory interface 40 for converting data into a suitable format for transmission between the primary data store and the SDMC 30. Accordingly, a coordinating processor generates a request for a block of data involving the calculation of the required data's addresses. This is transmitted to the SDMC 30 and the resulting data is obtained from the SDS 14. Typically, data transfers for contiguous data in the SDS 14 is transferred in a relatively fast manner using the SDMC 30 as a DMA controller. Otherwise, discrete addressing of memory is used to fetch the required data which is relatively slow. This is particularly the case when the data-parallel processor 18 requires non-contiguous (non-sequential) addresses that cannot be used in a block data transfer operation.
It is desired to overcome or at least substantially reduce at least some of the above described problems/disadvantages.