The present invention relates to the field of cache coherency in a multiprocessor environment, and more particularly to a multiprocessor system supporting issuing and receiving requests of multiple coherency granules.
A multiprocessor system may comprise multiple processors coupled to a common shared system memory. Each processor may comprise one or more levels of cache memory (cache memory subsystem). The multiprocessor system may further comprise a system bus coupling the processing elements to each other and to the system memory. A cache memory subsystem may refer to one or more levels of a relatively small, high-speed memory that is associated with a particular processor and stores a copy of information from one or more portions of the system memory. The cache memory subsystem is physically distinct from the system memory.
A given cache memory subsystem may be organized as a collection of spatially mapped, fixed size storage region pools commonly referred to as xe2x80x9csets.xe2x80x9d Each of these storage region pools typically comprises one or more storage regions of fixed granularity. These storage regions may be freely associated with any equally granular storage region (storage granule) in the system as long as the storage region spatially maps to the set containing the storage region pool. The position of the storage region within the pool may be referred to as the xe2x80x9cway.xe2x80x9d The intersection of each set and way contains a cache line. The size of the storage granule may be referred to as the xe2x80x9ccache line size.xe2x80x9d A unique tag may be derived from an address of a given storage granule to indicate its residency in a given set/way position.
When a processor generates a read request and the requested data resides in its cache memory subsystem, e.g., L1 cache, then a cache read hit takes place. The processor may then obtain the data from the cache memory subsystem without having to access the system memory. If the data is not in the cache memory subsystem, then a cache read miss occurs. The memory request may be forwarded to the system and the data may subsequently be retrieved from the system memory as would normally be done if the cache did not exist. On a cache miss, the data that is retrieved from the system memory may be provided to the processor and may also be written into the cache memory subsystem due to the statistical likelihood that this data will be requested again by that processor. Likewise, if a processor generates a write request, the write data may be written to the cache memory subsystem without having to access the system memory over the system bus.
Hence, data may be stored in multiple locations, e.g., the cache memory subsystem of a particular processor as well as system memory. If another processor altered the contents of a system memory location that is duplicated in a first processor""s cache memory subsystem, the cache memory subsystem may be said to hold xe2x80x9cstalexe2x80x9d or invalid data. Problems may result if the first processor inadvertently referenced this on a subsequent read. Therefore, it may be desirable to ensure that data is consistent between the system memory and caches. This may commonly be referred to as xe2x80x9cmaintaining cache coherency.xe2x80x9d In order to maintain cache coherency, therefore, it may be necessary to monitor the system bus when the processor does not control the bus to see if another processor accesses system memory. This method of monitoring the bus is referred to in the art as xe2x80x9csnooping.xe2x80x9d
Each processor""s cache memory subsystem may comprise a snooping logic unit configured to monitor the bus for the addresses requested by other processors. Each snooping logic unit may further be configured to determine if a copy of an address requested by another processor is within the cache memory subsystem associated with the snooping logic unit. The snooping logic unit may determine if a copy of the address requested by another processor is within the cache memory subsystem associated with the snooping logic unit using a protocol commonly referred to as Modified, Exclusive, Shared and Invalid (MESI). In the MESI protocol, an indication of a coherency state is stored in association with each unit of storage in the cache memory subsystem. This unit of storage is referred to a coherency granule and is typically the size of a cache line. Each coherency granule may have one of four states, modified (M), exclusive (E), shared (S), or invalid (I), which may be indicated by two or more bits in the cache directory. The modified state may indicate that a coherency granule is valid only in the cache memory subsystem containing the modified or updated coherency granule and that the value of the updated coherency granule has not been written to system memory. When a coherency granule is indicated as exclusive, the coherency granule is resident in only the cache memory subsystem having the coherency granule in the exclusive state. However, the data in the exclusive state is consistent with system memory. If a coherency granule is marked as shared, the coherency granule is resident in the associated cache memory subsystem and may be in at least one other cache memory subsystem in addition to the system memory. If the coherency granule is marked as shared, all of the copies of the coherency granule in all cache memory subsystems so marked are consistent with the system memory. Finally, the invalid state may indicate that the data and the address tag associated with the coherency granule are both invalid and thus are not contained within that cache memory subsystem.
Typically, in a multiprocessor system, the cache memory subsystems associated with the various processors may comprise a plurality of cache line sizes. Such a system may be considered a heterogeneous multiprocessor system. In such a system, the size of the coherency granule for the system is considered to be the size of the smallest coherency granule for any entity within the system. Thus, when a processor with a relatively larger cache line size performs a read or write operation for a cache line in the system, the operation may be associated with a plurality of coherency granules in the system. Similarly, a system may contain some non-processor entities, such as an I/O device or a DMA (Direct Memory Access) controller. Such non-processor entities may also perform operations in the system, which are associated with a particular block of memory. The size of the operation may vary and may consist of a plurality of coherency granules within the system.
When an operation is associated with a plurality of coherency granules, then as part of the operation the snooping logic associated with each processor may examine the coherency status of each of these coherency granules and respond accordingly. This may be accomplished by performing the operation as a series of independent requests where each request may consist of a single coherency granule. By issuing separate requests for each coherency granule involved in the operation, several additional bus cycles may be used and additional power may be consumed. These additional bus cycles and additional power may be associated with the independent requests themselves and the responses by the slaves to those independent requests. The additional bus cycles and additional power may also be associated with the independent snooping operations that may be performed by the snooping logic associated with each of the processors in the system. Alternatively, the system may perform the multi-coherency granule operation as a single request, but the snooping logic associated with each processor in the system may provide a single snoop response for the entire operation. The system in turn may have to wait for the snooping logic associated with each processor in the system to complete all of the snoop operations associated with the request before proceeding to initiate the transfer of data between the master entity making the request and the slave device for which the request is targeted. Again this procedure involves additional delay in performing the operation thereby inefficiently using the bandwidth available to the system.
It would therefore be desirable to develop a heterogeneous multiprocessor environment that supports the issuing and receiving of a single request that references multiple coherency granules. It would further be desirable to develop a heterogeneous multiprocessor environment that allows the snooping logic associated with each processor in the system to provide the snoop response for only a portion of the requested coherency granules at a time such that the system makes forward progress on the operation with less delay thereby improving the bandwidth of the system and reducing overall power.
The problems outlined above may at least in part be solved in some embodiments by a bus interface logic unit coupled between a slave, e.g., memory, and a plurality of masters, e.g., processors, configured to issue a request to a snooping logic unit in each cache in the multiprocessor system that a multiple coherency granule request is available for snooping. A coherency granule may refer to the smallest cache line size of a cache in the multiprocessor system. Each snooping logic unit may be configured to snoop a different number of coherency granules at a time. Once the bus interface logic unit has received a collection of sets of indications indicating that one or more coherency granules in the multiple coherency granule request has been snooped by each snooping logic unit in the multiprocessor system and that the data at the addresses for the one or more coherency granules has not been updated, then the bus interface logic unit may allow the data at the addresses of those one or more coherency granules to be transferred between the requesting master and the slave device. By transferring data between the requesting master and the slave device prior to receiving a set of indications regarding the other coherency granules in the multiple coherency granule request, the multiprocessor system may make forward progress on the operation of the multiple coherency granule request with less delay thereby improving the bandwidth of the system and reducing overall power.
In one embodiment of the present invention, a method for performing a read request comprising a plurality of coherency granules may comprise the step of a bus interface logic unit receiving a request from a master, e.g., processor, (commonly referred to as a master request), to read a block of data comprising a plurality of coherency granules in a slave, e.g., memory. The bus interface logic unit may be coupled to each master which may serve as an interface between a bus and each master. The bus may further be coupled to the slave. A coherency granule may refer to the smallest cache line size of a cache in a multiprocessor system.
The bus interface logic unit may issue a request (commonly referred to as a snoop request) to a snooping logic unit in each cache in the multiprocessor system indicating that a valid request is available for snooping. The bus interface logic unit may further issue a request to the slave to retrieve the data requested by the master. The bus interface logic unit may then receive the requested data from slave.
By a snooping logic unit being informed that a valid request is available for snooping, the snooping logic unit may then perform the snooping method on one or more of the coherency granules of the master request. Each snooping logic unit may be associated with a different sized cache line. That is, each snooping logic unit may be capable of snooping a different number of coherency granules at a time. Since snooping units may snoop a different number of coherency granules at a time, the data requested by the master may be transferred to that master by the bus interface logic unit in stages. That is, the bus interface logic unit may transfer one or more coherency granules of the data requested to the master at a time once each snooping logic unit has provided indications that the one or more coherency granules may be transferred as described in greater detail below. It is noted that even though the following describes steps performed by a particular snooping logic unit that the description is applicable to each snooping logic unit of the multiprocessor system.
As stated above, a snooping logic unit may perform the snooping method on one or more of the coherency granules of the master request. The number of coherency granules that may be snooped at one time by a snooping logic unit may be dependent upon the particular snooping logic unit. Once the one or more coherency granules have been snooped, the bus interface logic unit may receive an acknowledgment from the snooping logic unit that the snooping logic unit performed the snooping on the one or more coherency granules via a multiple bit bus. Each bit in the bus may be associated with a particular coherency granule in the multi-coherency granule request. The bus interface logic unit may further receive an indication from the snooping logic unit as to whether the one or more coherency granules snooped were a hit in the cache associated with the snooping logic unit via a multiple bit bus. Each bit in the bus may be associated with a particular coherency granule in the multi-coherency granule request. The bus interface logic unit may further receive an indication from the snooping logic unit as to whether the data associated with the addresses of the one or more coherency granules that were a hit in the cache associated with snooping logic unit have been updated in that cache via a multiple bit bus. Again, each bit in the bus may be associated with a particular coherency granule in the multi-coherency granule request. These indications may collectively be called a xe2x80x9ccollection of sets of indicationsxe2x80x9d where each set of indications, i.e., each corresponding bit in each bus, is associated with a particular coherency granule in the multi-coherency granule request.
A determination may be made by the bus interface logic unit as to whether any of the data at the addresses of the coherency granules snooped had been updated in a cache in the system. If the data in a cache at the address of the coherency granules snooped had not been updated, then the bus interface unit may transmit to the master the data associated with the one or more of the one or more coherency granules snooped that were not updated.
If the data at the address of a coherency granule snooped has been updated, then the bus interface logic unit may receive the updated data from the snooping logic unit associated with the cache containing the updated data.
In one embodiment, upon receiving the updated data, the bus interface logic unit may write the received updated data to the slave thereby updating the slave to maintain memory coherency within the multiprocessor system. The bus interface logic unit may then read the updated data from the slave and transfer the updated data to the master.
In another embodiment, upon receiving the updated data, the bus interface logic unit may instead directly transfer the received updated data to the requesting master. The bus interface logic unit may then subsequently or concurrently write the updated data to the slave.
A determination may then be made as to whether there are more coherency granules to snoop. If there are more coherency granules to snoop then the snooping logic unit may snoop one or more coherency granules of the request as described above. As stated above, each snooping logic unit may be configured to snoop at a different rate than the other snooping logic units thereby completing the snooping of all of the coherency granules of the request at a different time than the other snooping logic units. It is noted that the bus interface logic unit may be configured to only transfer the non-updated or updated data associated with those coherency granules that have been snooped by each snooping logic unit in the multiprocessor system. Subsequently, the requested data may be transferred to the master in a staggered manner.
If there are no more coherency granules to snoop, then the method is terminated.
In one embodiment of the present invention, a method for performing a write request comprising a plurality of coherency granules may comprise the step of a bus interface logic unit receiving a request (commonly referred to as a master request) to write a block of data to a slave, e.g., memory, comprising a plurality of coherency granules from a master, e.g., processor. The bus interface logic unit may be coupled to each master which may serve as an interface between a bus and each master. The bus may further be coupled to the slave. A coherency granule may refer to the smallest cache line size of a cache in a multiprocessor system.
The bus interface logic unit may issue a request (commonly referred to as a snoop request) to a snooping logic unit in each cache in the multiprocessor system indicating that a valid request is available for snooping. The bus interface logic unit may receive data to be written to the slave from the master.
By the snooping logic unit being informed that a valid request is available for snooping, the snooping logic unit may then perform the snooping method on one or more of the coherency granules of the master request. As stated above, each snooping logic unit may be associated with a different sized cache line. That is, each snooping logic unit may be capable of snooping a different number of coherency granules at a time. Since snooping units may snoop a different number of coherency granules at a time, the data received from the master may be transferred to the slave by the bus interface logic unit in stages. That is, the bus interface logic unit may transfer one or more coherency granules of the data received from the master at a time once each snooping logic unit has provided indications that the one or more coherency granules may be transferred as described in greater detail below. It is noted that even though the following describes steps performed by a particular snooping logic unit that the description is applicable to each snooping logic unit of the multiprocessor system.
As stated above, a snooping logic unit may perform the snooping method on one or more of the coherency granules of the master request. The number of coherency granules that may be snooped at one time by a snooping logic unit may be dependent upon the particular snooping logic unit. Once the one or more coherency granules have been snooped, the bus interface logic unit may receive an acknowledgment from the snooping logic unit that the snooping logic unit performed the snooping on the one or more coherency granules via a multiple bit bus. Each bit in the bus may be associated with a particular coherency granule in the multi-coherency granule request. The bus interface logic unit may further receive an indication from the snooping logic unit as to whether the one or more coherency granules snooped were a hit in the cache associated with the snooping logic unit via a multiple bit bus. Each bit in the bus may be associated with a particular coherency granule in the multi-coherency granule request. The bus interface logic unit may further receive an indication from the snooping logic unit as to whether the data associated with the addresses of the one or more coherency granules that were a hit in the cache associated with the snooping logic unit have been updated via a multiple bit bus. Again, each bit in the bus may be associated with a particular coherency granule in the multi-coherency granule request. These indications may collectively be called a xe2x80x9ccollection of sets of indicationsxe2x80x9d where each set of indications, i.e., each corresponding bit in each bus, is associated with a particular coherency granule in the multi-coherency granule request.
A determination may be made by the bus interface logic unit as to whether any of the data at the addresses of the coherency granules snooped had been updated in a cache in the system. If the data in a cache at the address of the coherency granules snooped had not been updated, then the bus interface unit may transfer to the slave the data associated with those coherency granules not updated as received from the master.
Alternatively, if the data in the cache at the address of the coherency granules snooped had been updated, then the bus interface unit may first allow the updated data to be copied from the associated cache and written to the slave. The bus interface unit may then transmit to the slave the data associated with those coherency granules that have been updated as received from the requesting master to overwrite the data copied from the associated cache thereby maintaining memory coherency.
A determination may then be made as to whether there are more coherency granules to snoop. If there are more coherency granules to snoop, then the snooping logic unit may snoop one or more coherency granules as described above. As stated above, each snooping logic unit may be configured to snoop at a different rate than the other snooping logic units thereby completing the snooping of all of the coherency granules of the request at a different time than the other snooping logic units. It is noted that the bus interface logic unit may be configured to only transfer the data received from the master to the slave associated with those coherency granules that have been snooped by each snooping logic unit in the multiprocessor system. Subsequently, the data requested to be written by the master may be written to the slave in a staggered manner.
If there are no more coherency granules to snoop, then the method is terminated.
The foregoing has outlined rather broadly the features and technical advantages of one or more embodiments of the present invention in order that the detailed description of the invention that follows may be better understood. Additional features and advantages of the invention will be described hereinafter which form the subject of the claims of the invention.