1. Technical Field
The present application relates generally to an improved data processing system and method. More specifically, the present application is directed to a system and method for reducing store latency in symmetrical multiprocessor systems.
2. Description of Related Art
In symmetrical multiprocessing (SMP) systems, there are three basic components: the processing units with their cache, input/output (I/O) devices with their direct memory access (DMA) engines, and a distributed system memory. The processing units execute instructions while the I/O devices handle the physical transmission of data to and from memory using their DMA engines. The processing units also control the I/O devices by issuing commands from an instruction stream. The distributed system memory stores data for use by these other components.
As technology advances, SMP systems use a greater number of processing units and have increased system memory sizes. As a result, the modern SMP system utilizes a plurality of separate integrated circuit (IC) chips to provide these resources. These separate IC chips need to be able to communicate with each other in order to transfer data between all the components in the SMP system. Moreover, in order to keep the processing units' caches coherent, each IC chip in the SMP system needs to be able to see each command issued by processing units of each of the other IC chips.
The processing units' caches keep copies of data from system memory in order to allow the processing unit fast access to the data. A coherent architecture allows caches to have shared copies of data. Alternatively, the coherent architecture allows caches to have exclusive copies of data so that the corresponding processing unit can update the data. With exclusive copies of data, the data in the processing unit's cache is the most up to date version of the data since that processing unit is the only one permitted to modify the data. In order to keep each of the processing units' caches valid, each command in the SMP system has to be seen by each IC chip so that out of date copies of data can be invalidated and not used for future processing. Eventually, the modified copy of data in a processor's cache will be written back to system memory and the entire process can start over again.
In order to simplify the design of the various components, all commands are sent to an arbiter which makes sure that no two commands to the same address are permitted to be active and access that address at the same time. If the architecture allowed two commands to the same address to be active in the SMP system, the various components of the SMP system would have to keep track of each address they had acknowledged and compare it against the new address to see if they were already in the middle of a transfer for that address. If the component was in the middle of a transfer, the second command would need to be retried so that it can complete after the current transfer is completed. Moreover, if two or more processing units were attempting to obtain exclusive access to a cache line, the processing units may “fight” for ownership, thereby reducing system performance. By having the arbiter ensure that no two commands to the same address are active at the same time, the logic needed in each system component is reduced.
FIG. 1 is an exemplary diagram illustrating a known architecture in which the arbiter is provided as a separate IC chip in the SMP system. As shown in FIG. 1, the SMP system 100 includes four IC chips 110, 112, 114, and 116. Each IC chip 110-116 contains one or more processing units (PUs) 120-127, a corresponding L2 cache 130-136, a local memory 140-144, and an input/output (I/O) unit 150-156. In this architecture, a separate IC chip 160 is provided which performs the arbiter operations. This separate IC chip 160 is connected to each of the four IC chips 110-116 using unique data wires.
Command information flows between the arbiter IC chip 160 and the IC chips 110-116 as shown in FIG. 1 diagrammatically. That is, each individual IC chip 110-116 may individually communicate directly with the arbiter IC chip 160. Moreover, each IC chip 110-116 communicates with its two neighboring IC chips in a ring fashion. For example, IC chip 110 may send commands/data to IC chip 112, IC chip 112 may send commands/data to IC chip 114, IC chip 114 may send commands/data to IC chip 116, and IC chip 116 may send commands/data to IC chip 110.
When a new command is issued by a PU of an IC chip 110-116, the IC chip 110-116 will forward the command to the arbiter IC chip 160 which performs arbitration functions for the SMP system 100. When the arbiter IC chip 160 determines it is time for the command to be sent, it forwards the command to each IC chip 110-116 which in turn each forward the command to their internal PUs. Each PU responds to the command to indicate it has seen the command and to inform the arbiter IC chip 160 as to whether it is too busy to process the command and it should be retried, whether the PU has ownership of the portion of data corresponding to the command and the command must be retried, or whether the command is okay to go forward. These responses, i.e. partial responses, are sent back to the arbiter IC chip 160. The arbiter IC chip 160 then combines the partial responses and builds a combined response that is sent to each of the four IC chips 110-116. Once each PU on each IC chip 110-116 has seen the combined response and the combined response is determined to be “good” (i.e. not retried), the data may be moved to the cache of the destination IC chip 110-116. In addition, the IC chip of the PU issuing the command, and all cache states of the IC chips 110-116 may be updated.
One problem in these multiple node SMP systems is that a first node may need data that is stored in a second node's memory or cache and the first node may not have any idea where the necessary data is located. Therefore, there must be a method of communication between the nodes in the SMP system. The arbiter controls the communication between the nodes in this manner.
FIG. 2 is an exemplary diagram illustrating a conventional example of a cache miss or direct memory access (DMA) operation through a four node SMP system, such as that shown in FIG. 1 above, in accordance with a known architecture. As shown in FIG. 2, in order to modify data content within a cache line of one of the local caches 230-236 of one of the nodes 210-216, a cache controller of a node 210-216 needs to first get ownership of the cache line before the data modification can occur. The requirement to obtain ownership of the cache line is a technique for ensuring that only one process may manipulate data in a cache line at one time. As a result, the integrity of the data in the shared cache is maintained.
Typically, there are five steps, or command phases, to modify data in a “shared” cache line, i.e. a cache line that stores data that is currently located in more than one local cache in the SMP system. These five steps or command phases will now be described in detail.
The first phase is an initial ownership request (referred to as a “Dclaim”) which results from a cache hit to a “shared” cache line in the requesting node, for example. The Dclaim is sent to the bus arbiter 260, which handles the system bus operations. The Dclaim is sent with a transaction tag which is a unique code identifying the transaction.
The second phase is a reflected command, wherein the arbiter broadcasts the request to bus agents (not shown) of all nodes 210-216 in the SMP system. The reflected command is produced by the bus arbiter 260 and includes the transaction tag of the Dclaim.
The third phase involves the bus agents 270-276 of the nodes 210-216 “snooping” the reflected command, checking their associated local caches 230-236 and system memories 240-246 for the requested data, and providing a snoop reply with the requestor's transaction tag. The snoop replies specify the results of searching the caches 230-236 and system memory 240-246 of the nodes 210-216.
The fourth phase involves the bus arbiter 260 receiving the snoop replies, also referred to herein as partial responses, from the nodes 210-216 in the SMP system and generating a combined result of all the snoop replies. The bus arbiter 260 combines all the snoop replies from the bus agents 270-276 and broadcasts a combined response back to all of the bus agents 270-276 with the requestor's transaction tag. This combined response informs the nodes 210-216 how to proceed with the original ownership request.
The fifth phase is the data transfer phase. The node with the data, e.g., node1 212, is able to send the data to the requesting node, e.g., node0 210, using information from the original reflected command and the combined response.
For example, assume that node0 210 has a store command which hits a “shared” cache line in the cache 230 of node0 210. In accordance with the known architecture and methodology, node0 210 sends an initial ownership request (1), i.e. a Dclaim, to the bus arbiter 260 with the memory address range of the requested data and a transaction tag. The bus arbiter 260 sends out a reflected command (2) to the nodes 210-216. Each of nodes 210-216 then snoop (search) their caches 230-236 and system memory 240-246 for the requested data corresponding to the requested memory address range.
After the nodes 210-216 have snooped their caches 230-236 and system memory 240-246, they send out a snoop reply (3). In the depicted example, node0 210 may send a snoop reply (3) that indicates a null response because it is the requesting node and does not have the requested data, as determined by the requested address range. Likewise, node1 212 may send a snoop reply (3) that indicates a null response because it also does not have the requested data.
Node2 214 is busy and cannot snoop its cache 234. Thus, node2 214 sends a snoop reply (3) with a retry being identified, e.g., through setting a retry bit, meaning that the original ownership request needs to be resent at a later time.
Node3 216 has the accurate, updated data and sends a snoop reply (3) with intervention identified, such as by setting an intervention bit. The intervention bit signifies that node3 216 has the most up-to-date data for the requested address range. Node3 216 may know whether or not it has the most up-to-date data for the requested address range based on a setting of a cache state identifier that indicates the status of the data. The cache state identifier may indicate whether the data is modified, invalid, exclusive, etc.
The bus arbiter 260 collects the snoop relies (3) from all of the nodes 210-216. The arbiter 260 sees that a retry bit has been set and orders a combined response of “retry” (4), which indicates that this request must start over because one node 214 was busy and unable to snoop its cache 234. When node0 210 sees a “retry” combined response (4), it sends its original ownership request out to the bus again and the process starts over.
Inefficiencies are present in the known architecture due to processing multiple ownership requests for the same shared cache line. The arbiter operates to resolve multiple requests for the same cache line (which may or may not be multiple requests for the same address range since the address ranges specified are typically less than the entire size of a cache line) such that only one ownership request becomes the “winner” and the other ownership requests become “losers” that must be retried, i.e. the ownership request must be reissued by the requesting node. The “winner” sends out another request, i.e. a Kill request, to remove the validity of the cache line in the other caches of the other nodes, which starts from the first phase mentioned above. This Kill request needs to be honored, by operation of the second through fourth phases discussed above, before the data modification may be performed.
The “losers” will keep repeating the first through fourth phases discussed above, reissuing the ownership request until the winner's Kill request is completed and all other cache lines are invalid. The losers will then change the ownership request type to a “Read With Intent To Modify” (RWITM) which starts again from the first phase and proceeds through to the fifth phase.
These operations associated with the Kill request take a considerable amount of time to get resolved, especially in large symmetrical multiprocessor systems. As a result, these operations affect the overall system performance. Thus, it would be beneficial to have a protocol that can more efficiently resolve multiple requests to modify shared data in a multiprocessor system.