In a symmetrical multiprocessing system, there are three main components: the processing units with their cache; the input/output (I/O) devices with their direct memory access (DMA) engines; and the distributed system memory. The processing units execute instructions. The I/O devices handle the physical transmission of data to and from memory using the DMA engine. The processing units also control the I/O devices by issuing commands from an instruction stream. The distributed system memory stores data for use by these other components. As the number of processing units and system memory size increases, the processing systems need to be housed in separate chips.
The separate chips need to be able to communicate with each other in order to transfer data between all the components in the system. Also, in order to keep the processing unit's caches coherent, each device in the system needs to see each command issued. The processing unit's caches keep copies of data from system memory in order to allow the processing unit fast access to the data. The coherent architecture allows caches to have shared copies of data (data is unmodified and the same as in system memory), or exclusive copies of data so the processing unit can update the data (the data in the cache is the most up to date version). In order to keep each of the processing unit's caches valid, each command in the system has to be seen by each device so out of date copies of data can be invalidated and not used for future processing. Eventually, the modified copy of data will be written back to system memory and the entire process can start over again.
In order to simplify the design of the various components, all commands are sent to an address concentrator which makes sure no two commands to the same address are allowed in the system at the same time. If two commands to the same address were allowed in the system at the same time, the various components would have to keep track of each address they had acknowledged and compare it against the new address to see if they were already in the middle of a transfer for that address. If they were in the middle of a transfer, they must retry the second command so it can complete after the current transfer is completed. Also, if two or more processing units were trying to get exclusive access to a cache line, they could “fight” for the ownership and reduce system performance. By having the address concentrator ensure no two commands to the same address are active at the same time, the logic needed in each system component is reduced.
Current systems implement the address concentrator as either a separate chip in the system, as seen in FIG. 1, or as a component in one of the chips, as seen in FIG. 2. Each approach has its advantages and disadvantages.
The separate chip case of FIG. 1 has the advantages that each processing chip in the system has direct access to the address concentrator (AC) and the amount of time to get to the AC is consistent from each chip. The disadvantages of the separate chip are the added cost of the extra chip in the system and the added pins on each processing chip to access the AC chip. Also, the single AC must be able to keep up with four chips' worth of commands, so the processing speed requirements of the AC chip are increased.
In FIG. 1, the system includes four chips 10a, 10b, 10c, and 10d, each of which contains one or more processors 12. In this configuration, a separate chip 14 is provided which performs the AC function. The separate chip 14 is connected to each of the chips 10a-10d using unique data wires, and command information flows between the chip 14 and the chips 10a-10d as shown diagrammatically. When a new command is issued, the processor chip will forward the command to the AC chip 14 and this chip will perform the address concentration function for the system. When the AC function determines it is time for the command to be sent, it will forward the command to each chip 10a-10d and each will send this command to all the internal units. Each unit will respond to the command, and the partial responses will be sent back to the AC chip 14. This AC chip will then combine all partial responses, build a combined response and send this to each of the four chips 10a-10d. Once each unit on each chip has seen the combined response, the data can be moved from the source to the destination and all cache states can be updated. All commands must flow through AC chip 14 and, therefore, the AC chip has to be designed to keep up with four chips' worth of commands.
The single address concentrator in one of the processing chip's case of FIG. 2 has the advantages of reduced system cost because no separate chip is needed, nor any additional pins (the access to the AC is using existing connections between chips). The disadvantages of the single AC is the burden of keeping pace with four command streams and the added processing speed needed. Also, the delay to the AC varies with each chip in the system. P0 has direct access to the AC, but P1 is one chip hop away. P2 is two chip hops away. Each chip hop can take several clocks to complete the transfer. Therefore, depending on where the AC is located and where the commands are originating, the time to service the commands is not consistent and varies across the system.
In FIG. 2, the system includes four chips 10a, 10b, 10c, and 10d. Each chip contains one or more processors 12. Chip 10a also contains a logic block constituting the address concentrator function for the system 14a. The four chips are connected in a ring fashion and both command and data travel on the ring buses. In this example, chip 10b sends the command 20 to the AC function in chip 10a. When the AC function determines the command can be sent to the system, it will send a reflected command to each chip. This is accomplished by sending the reflected command 22 to chip 10b, which will send it to each of its internal units and also forward the reflected command 22 to chip 10c. Chip 10c will do likewise, by sending the reflected command to all its internal units and also forwarding the reflected command 22 to chip 10d. Chip 10d will send the reflected command to all its internal units, but will not need to forward the reflected command to chip 10a because this is where it was started. Each chip will gather its partial response 24 and forward it to chip 10a. The AC function in chip 10a will build the combined response 26 and send it to each chip in the system. When all units on each chip have seen the combined response, the data source can send the data to the destination and all cache states can be updated. Again, all commands must flow through this AC function in chip 10a and it must be able to keep up with four chips' worth of commands. Also, depending on where the command originates, the latency to resolve the command varies because of the time it takes to get the command to the AC function in chip 10a. 