The invention concerns a procedure for parallel data processing, in which data are read out from a data memory and are conveyed to processing units for parallel processing via a communications unit.
The invention also concerns a processor arrangement for parallel data processing, with a data memory and parallel data-processing units, which are connected to each other via a communications unit.
Processors with parallel data processing are known, such as those described in the book by J Hennessey, D. Patterson: Computer Architecture: A Quantitative Approach, Morgan Kaufmann Publishers Inc., 1990. There, in order to achieve a parallel processing, the processing units of an architecture are designed in multiple fashion. For a simultaneous utilization of these parallel processing units, data must be conveyed to them in parallel. For that purpose, it may be necessary to design multiple data units. This requires either a multiprocessor system or else one must provide a common utilization of a data memory by several processing units.
In these known arrangements, it is a disadvantage that, in a design with multiple processing units, one must either provide multiple data memories or else one must insert sets of intermediate registers which are capable of re-ordering their contents by means of a fully linked communications network. This in turn requires either structuring several address-generating units, e.g., memory ports, as well as the structuring of a connecting network between these memories, or else requires a full connecting network between the sets of registers.
For instance, a circuit of the firm Texas Instruments Inc., designated as C80/C82, is known, in which there are several address-generating units, memory ports, and a comprehensive connecting network. In each there are featured 4/2 digital-signal processor cores as well as a Reduced-Instruction-Set Computer, which are connected to a memory by means of a crossbar network.
Furthermore, from C. Hansen, xe2x80x9cMicrounity""s Media Processor Architecture,xe2x80x9d IEEE Micro, pp. 34-38, August 1997, a processor provided with a complete connecting network is known, in which the executing units are connected directly with the set of registers. Communications are carried out between two sets of registers and are provided by the general network.
It is an object of the invention to raise the degree of parallelization of a processor architecture without increasing the number of memories and/or the width of the connecting network.
This object is achieved in that the data, divided into data groups with several elements, are stored under one and the same address. A processing unit is allocated to each element of a data group, in that the element can be connected directly with the allocated processing unit, bypassing the communications unit. Simultaneously and in parallel, a data group is read out from the data memory, is divided up among several processing units, and is processed in parallel in these processing units.
For a data group to be read out from the data memory, it suffices that the address of this data group be called up. In that case, individual addressing of data elements can be omitted. Next, each data group can be conveyed either directly to the allocated processing unit or else can be distributed to other processing units with the aid of the communications unit. If several data were computed by the processing units, these can in turn be written into the data memory or else distributed via the communications unit.
The decision as to whether to convey the data group directly to the processing unit or whether to distribute it via the communications unit to other processing units depends on the object to be achieved. It is clear, however, that through the availability of the option to apply the data groups directly to allocated processing units, the communications unit undergoes a load reduction due to which it can be reduced in its width, hence in its cost.
One embodiment of the invention provides that data can be directly shifted from the processing units to mutually adjacent processing units. This direct shifting also causes a further decrease in the load of the communications unit, which promotes the reduction in its width.
In an embodiment of a method according to the invention, it is provided that elements of a data group are distributed via the communications unit to one or several processing units.
Before processing the elements of a data group in the processing units, the elements can be delayed by one stepxe2x80x94i.e., until the arrival of a new element, in a following step. In this fashion, one can finish the treatment of the data from the preceding step in the processing units such that finished data are then available at the output of the processing units when the processing computation of the current elements of the data group is concluded. In the meantime, the data at the processing unitxe2x80x94i.e., the results of the preceding stepxe2x80x94can be used by other processing units or else can be written back into the data memory.
The task is solved in an architecture where the data memory is designed as a group memory. In the group memory, at least one data group with several elements is stored under one address. The communications unit is designed as an overall communications unit. This means that the overall communications unit features a width which is less than the number of elements in one data group. The data memory is linked directly with the overall communications unit. To each element of the data group is allocated a part of the overall communications unit and a processing unit, which consists of a number of process units and an number of memory units.
This allocated part of the overall communications unit and of the processing unit are arranged in a strip. This strip is adjacent to other strips of the same structure. In the width of one element of a data group within a strip, the data memory is directly connected with the memory elements of an allocated processing unit.
By means of this arrangement, it is possible to either feed data elements of a data group directly to an allocated processing unit, or else to distribute them via the communications unit to other processing units. This is also supported geometrically by the stripwise allocation of elements of the data group, part of the overall communications unit and processing unit; this makes it possible to design the communications unit as an overall communications unit with a reduced width, compared to the prior art.
In those cases in which there are very frequent communications between the data memory and the allocated processing unit, such communication can be carried out via a direct link. The greater the proportion of these direct connections, the greater the possible reduction in the width of the overall communication unit. In a further embodiment, local communications units are provided and arranged between the processing units of adjacent strips. These local communications units feature a width which is at least one (1) and is at most equal to twice the number of memory elements in a processing unit.
By way of the local communications units, data of mutually adjacent processing units can be exchangedxe2x80x94in particular, processing resultsxe2x80x94without having to use the overall communications unit for that purpose. This makes an additional contribution to the load reduction of the overall communications unit, and makes it possible to design this overall communications unit in a narrower fashion.
In a further embodiment of the invention, it is provided that the memory elements are designed as registers.
In a simple structure of the circuit arrangement according to the invention, it is provided that the width of the overall communications unit is equal to 0. In that case, the data memory is designed as a group memory, in that at least one data group with several elements is stored under one address. To each element of the data group is allocated a processing unit, consisting of a number of process units and a number of memory units. The latter are arranged in a strip which is adjacent to other strips with the same structure. The data memory is directly connected with the memory elements of an allocated processing unit, in the width of one element of one data group, within the strip.