The present invention relates to data processors and in particular to processors having a plurality of processor units capable of operating in parallel.
A bus refers to a collection of wires through which data is transmitted from one part of a computer to another. There is a bus that connects all the internal computer components to the CPU and main memory. There's also an expansion bus that enables expansion boards to access the CPU and memory. The size of a bus, known as its width, is important because it determines how much data can be transmitted at one time. For example, a 16-bit bus can transmit 16 bits of data, whereas a 32-bit bus can transmit 32 bits of data. A point-to-point bus directly connects the two components communicating going from a specific source to a specific destination e.g. a computer and printer connected by a ribbon cable. A broadcast bus is used to communicate with several devices, where all the devices connected to the bus receive all the signals broadcast so an address of the device to receive the signal must be broadcast as well. Generally, a data bus is used for transferring data; an address bus is used for identifying where the data is going to; and a control bus is used for controlling signals such as read or write.
A general goal for computers is faster and faster operation. One solution has been to develop individual processor units with higher operating speeds. Other solutions have been to develop computers with multiple processor units operating in parallel. Compared to a computer with a single processor, parallel computers have not had the desired increase in operating speeds as might be expected. As the number of parallel processor units have increased, the interplay between the parallel processors has become much more complex and the marginal increase in operating speeds has fallen.
SIMD (Single Instruction, Multiple Data), which represents one of styles of parallel processing, is a set of operations for efficiently handling large quantities of data in parallel, as in a vector processor or array processor. The most important architectural aspect of SIMD is the organization of the processor array. One such architecture is the processing element to processing element organization. In this configuration, N processing elements are connected via an interconnection network. Each processing element (PE) is a processor with local memory. The PEs execute the instructions that are distributed to the PEs by an array control unit (ACU) via a broadcast bus. A second SIMD architecture is the processor to memory organization. In this configuration, a bidirectional interconnection network connects the N processors and M memory modules. The processors are controlled by the ACU via the broadcast bus. Data is exchanged between processors via the interconnection network and the memory modules. Again, data transfers between the memories and the I/O interface are handled via the I/O bus, and a result bus is used.
One of the impediments to high speed parallel processing has been long routing time. Typically a broadcast bus is just included to all of the processing units. In conventional data processors which have 8-bit computational units, we simplified things by running a single bit broadcast and loading 8-bits in one at a time (a serial transfer of bits). Another way to get around a broadcast bus is to keep all of the constants in the computational units own memory space. The problem with this system is that if you have a lot of constants, then a great deal of memory space is “wasted” saving the constant values when it could be used for something else. The disadvantage of the prior art is that more routing lines (wires) are required to get the broadcast bits to the computational units. The performance using this method of broadcast bus will make the memory bus marginally more busy. Secondly, if the constants are all just saved in each CU's memory space than a great deal of memory space will need to be reserved for constant operations and will be unavailable for processing. Finally, performance will not be as good as having the full constant word broadcast to each computational units; however, compared to a bitwise constant broadcast we will still see a significant improvement in performance.
The present invention offers implementation described was created in order to reduce the routing congestion in the design and at the same time increase performance. By reusing the memory buses, the need for a dedicated broadcast bus connected to the computational units is eliminated.