Symmetric multiprocessing (SMP) systems employ many parallel-operating central processing units (CPUs) which independently perform tasks under the direction of a single operating system. One type of SMP system is based upon a plurality of CPUs employing high-bandwidth point-to-point links (rather than a conventional shared-bus architecture) to provide direct connectivity between the CPU and to input/output (I/O) devices, memory units and/or other CPUs.
When tasks of a running application are being performed by a plurality of the SMP CPUs, individual CPUs may perform an operation that determines a value of information. The information may be stored temporarily in the CPUs cache memory or the like which is not accessible to other CPUs, or stored back into a memory unit to which the CPU is attached.
In one exemplary situation, the “owning” CPU is the CPU that is responsible for the most currently available information. For example, during execution of an operation, a first CPU may retrieve a value of information from a memory unit and become the owner of the information. The first CPU (now the owning CPU) may then determine a new value of the information, and write the new information to its cache memory. When other CPUs perform related operations that use the same information, it is important that those CPUs use the most current information that is available. However, the most current information may not necessarily be available to the other CPUs unless the owning CPU has stored the new information into a commonly-accessible memory unit. If the information has been determined and cached by the owning CPU, but not yet transferred to the memory unit, other CPUs must discover and retrieve the information stored in the cache memory of the owning CPU if their operation is to use the most currently available value of the information.
Snoop operations are employed by CPUs to “snoop out” the most currently available information that may be residing, for example, in another CPU's cache memory. Accordingly, if another CPU is operating on the information and saving the information into its cache, the CPU performing the snoop recognizes that most current information that is available resides in the cache of the other CPU that currently owns the information. The CPU performing the snoop then retrieves the new value of the information residing in the cache of the owning CPU. Accordingly, the CPU performs its task using the most current available information. In some situations, the CPU performing the task may modify and save the information, thereby becoming the “owning” CPU.
In other situations, a CPU owns a copy of the information without having modified the data. The information has access rights that instruct the CPU what can be done with the data. In some cases the access rights are “shared” or “read-only” meaning that the CPU that owns the information cannot modify the information. To modify the information would first require obtaining the right to modify from the directory. It is typical that this type of data is owned by several different CPUs. In that case, before the information is modified, each copy of the data must be recalled or invalidated before one CPU is allowed to modify the data. In other cases, the access rights are “exclusive” or “private” meaning that the CPU that owns the information has the only copy of the information and has the right to modify the data without alerting the system. In all of these cases, the directory is responsible for tracking which of the CPUs may own a piece of information and the access rights to the piece of information so that coherency may be maintained. The access rights are well understood by those skilled in the art as the “MESI” protocol (Modified-Exclusive-Shared-Invalid).
When a plurality of SMP CPUs coordinate information in the CPU caches or other memory units, such that tasks are performed with the most currently available information, the SMP is said to be operating in a coherent manner (coherent operation). However, when many CPUs are simultaneously performing operations in a SMP system, overall operating speed of the system is degraded since time is required for the various CPUs to perform their snooping operations, and since time is required for the subsequent retrieval of current information from remote caches and/or memories when necessary.
FIG. 1 is an illustrative exemplary prior art SMP system employing a plurality of CPUs and input/output (I/O) devices. During fabrication, clusters of CPUs may be fabricated onto a single die for convenience and efficiency. The exemplary CPU clusters A and B each have four CPUs for illustrative purposes.
CPU cluster A has four CPUs (A-1 through A-4). Similarly, CPU cluster B has four CPUs (B-1 through B-4). Each CPU has its own cache. The CPUs (A-1 through A-4, and B-1 through B-4) employ high-bandwidth, point-to-point links 106 to couple to the other CPUs of the cluster, to directories (DIR), to memory units (illustrated as dual in-line memory modules, DIMMs), and/or I/O devices (not shown).
In the simplified exemplary SMP system 102 of FIG. 1, two CPU clusters A and B are communicatively coupled together via crossbar system 104. During the fabrication process, the CPU clusters, DIMM(s) and directories may be installed on a common board. A plurality of such modular boards may be installed in a chassis, and coupled to the crossbar system 104 to facilitate communications among the various components.
As an individual CPU performs an operation that determines a new value of information, it stores a working version of the determined new information into its cache. The owning CPU, at some point during the operation, may store the determined new information into its respective DIMM, or into another DIMM, depending upon the circumstances of the operation being performed by the CPU. Accordingly, CPU A-1 may store information directly into its cache or DIMMs A1-1 through A1-i. Other CPUs, similarly illustrated, have their own caches and are also coupled to their own DIMMs. For example, CPU B-3 may store information into its cache and/or into DIMMs B3-1 through B3-i. 
The CPUs are coupled to the external directories (DIR), via the CPU's high-bandwidth, point-to-point links 106. The directories are memory-based devices that are responsible for tracking information that is cached by CPUs in other CPU clusters. For example, DIR A-3 tracks information in DIMMs associated with the CPUs of CPU cluster A that has been exported from CPU cluster A via DIR A-3. When a CPU sends out a snoop request, the snoop request is sent to its directory and to the other CPUs of the cluster. When the most currently available information resides outside of the CPU cluster, the directory more quickly coordinates the determination of where the most current information is stored.
The directories are coupled together through crossbar system 104, via connections 112. Crossbar system 104 is a plurality of individual crossbars 110 coupled to each other, via connections 114, to facilitate direct device-to-device communications over the network fabric of the crossbar system 104.
For example, CPU A-3 may be ready to perform an operation using a particular value of information. However, before performing the operation, CPU A-3 would issue a snoop request to determine if a more current value of the information is available from the cache of a CPU in its cluster, from the cache of a CPU in a remote cluster, or in a remote DIMM. For this simplified illustrative example, assume that CPU B-3 (the current “owning” CPU) has just completed an operation on the information of interest, and has stored the new value of the information in its cache. Coherent operation would require that CPU A-3 operate on the most currently available information, here residing in the cache of CPU B-3. The snoop request from CPU A-3 eventually indicates that the most currently available information resides in the cache of CPU B-3. The newest value of the information is then retrieved and communicated to CPU A-3.
Furthermore, the DIR A-3 notes that CPU A-3 has most recently retrieved the information for further processing, and that other CPUs that later use this information should get the new value of information from CPU A-3 (depending upon the location of the information, which may be residing in its cache or one of its DIMMs).
The process of snooping and retrieving the most current information involves communicating a snoop from CPU A-3 to the other CPUs in the cluster and to the directories to which it is coupled to, here at least DIR A-3. DIR A-3, which is keeping track of the state of information of CPUs in cluster A, may indicate that the most current value of the information resides outside of CPU cluster A as a result of the snoop process between directories, described below.
In the simplified example above, DIR A-3 tracked that the information has previously been communicated out of CPU cluster A, via DIR A-3, to one of the CPUs in CPU cluster B. DIR A-3 sends a snoop to an appropriate subset of, or all of, the other system directories to find out where the most current value of the information is. DIR B-3, in this simplified example, responds that a CPU on its cluster has the most current information (whether it is in the cache of one of the CPUs in CPU cluster B, or whether it has been saved in one of the DIMMs coupled to CPU cluster B), and communicates the information back to DIR A-3. Then, the information received from DIR B-3 is reconciled with information from the CPUs in CPU cluster A, such that the most currently available location and/or value of the information is determined.
Furthermore, DIR A-3 notes that the information has been returned to CPU cluster A. Accordingly, if another remote CPU subsequently requires the information, upon receiving a snoop and/or request from a DIR associated with the requesting CPU, DIR A-3 knows that the most currently available information resides somewhere within CPU cluster A. DIR A-3 can then snoop the CPUs of CPU cluster A to determine the location of the information, which can then be retrieved and communicated to the requesting CPU. (Once the information is communicated to the requesting CPU, the directory associated with the requesting CPU then notes that its CPUs have the information for the next time the information is needed.)
The task of identifying the location of the information becomes even more complex when there are many CPU clusters in the SMP system. For example, consider an SMP system having ten or more CPU clusters. In some instances, with respect to the above-described example, other remote CPUs may later retrieve the information from CPU B-3 (before CPU A-3), thereby becoming the new owning CPU. When, as described in the example above, CPU A-3 begins the process of retrieving the information, many snoops across the entire SMP system will be required to identify the location of the most current value of the information (i.e.: determine the identity of the remote owning CPU). If the information has moved among several other CPUs, the process becomes even more complex and time consuming. Further complications may arise when different CPUs in different CPU clusters own a copy of the information with shared access rights. For a CPU to modify that information, all copies of the information must first be recalled or invalidated before new ownership is granted and the modification is performed.
CPU clusters A and B, DIMM units 108, crossbar system 104, DIR A-3 and DIR B-3 are discrete components. Also, the crossbar system 104 is itself a plurality of individual crossbars 110 coupled to each other to facilitate direct point-to-point communications. Accordingly, it is appreciated that the above-described process of snooping and then retrieving information requires a relatively significant amount of time, in part, due to the relatively large number of “hops” from one device to another.
Furthermore, it is appreciated that the topology of the above-described SMP system 102 is very simplistic. Many different topologies for connecting components of a CPU cluster are known. For example, the topology of CPU cluster A is illustrated differently from the topology of CPU cluster B to indicate the diversity of possible CPU clusters. Also, I/O devices may be included and/or may replace CPUs of any of the clusters. SMP systems may employ many CPU clusters. Such CPU clusters may have more than, or fewer than, the four CPUs illustrated in CPU clusters A and B. Furthermore, the many possible topologies of crossbar system 104 are possible, and accordingly, cannot be feasibly illustrated in FIG. 1 to any degree of detail. The coupling of the directories (DIR) to their respective CPUs, and the coupling to the crossbar system 104, may also vary. Accordingly, the simplified exemplary SMP system 102 of FIG. 1 is a generic SMP system that is representative of any one of the possible SMP topologies in use.
As the complexity of an SMP system increases, such as when the CPU clusters employ more CPUs, or when an SMP system employs more CPU clusters, or when more directories are used to facilitate snoop operations, more and more time will be spent for snooping operations, and the associated retrieval of cached information, such that coherency of the SMP system is maintained. Accordingly, it is desirable to improve operating time by streamlining the snooping process and the associated retrieval of cached information.