Processors for parallel data processing have long been known. A characteristic of common parallel processor architecture is the provision of a plurality of processing units, by which parallel processing of data can be accomplished. Such an architecture and processing unit assigned method are described, for example, in German Letters of Disclosure DE 198 35 216. This German Letters of Disclosure describes data in a data memory being split into data groups with a plurality of elements and stored under one and the same address. Each element of a data group is assigned to a processing unit. All data elements are simultaneously read out of the data memory in parallel and distributed as input data to one or more processing units, where they are processed in parallel under clock control. The parallel processing units are connected together via a communication unit. A processing unit comprises at least one process unit and one storage unit, arranged in a strip. Each strip in the processing unit is generally adjacent to at least one additional strip of like structure.
Such processor units may be referred to as Single Instruction Multiple Data (SIMD) vector processor. In SIMD processors, the respective data elements are processed in the parallel data paths (i.e. strips) as described above. Depending upon the program to be processed, the partial results may be written in the group memory as corresponding data elements or as data groups. Under some circumstances, however, it may be necessary to bring together processed data from parallel data paths. For example, in the performance of an algorithm on the vector processor, it may be necessary to link together into a global intermediate result data calculated locally from a plurality of strips or alternatively from all strips. For this purpose, in prior art, the partial results of the strips have been calculated with the aid of a program over a plurality of clock cycles in order to obtain the desired intermediate result. If this global intermediate result is required for subsequent calculations of the algorithm, calculation of the end result is delayed.
Consideration is now being given to improved parallel processing methods and arrangements. The desirable processing methods and arrangements achieve higher processing speeds, for example, by incorporating processor functionality that permits local data from individual data strips to be linked without requiring a great expenditure of time.