1. Field of the Invention
The present invention relates to the field of caches. More particularly, the invention relates to hybrid caching architectures and protocols for multi-processor computer systems.
2. Description of the Related Art
Multi-processor multi-cache computer systems with cache-coherent memories can be based on several cache architectures such as Non-Uniform Memory Architecture (NUMA) and Cache-Only Memory Architecture (COMA). In both examples, cache-coherence protocols are also needed if coherency is to be maintained between the respective caches.
FIGS. 1A, 1B, and 1C-1D are a block diagram, an address map and two flowcharts, respectively, illustrating a cache-coherent NUMA (CC-NUMA) computer system 100. As shown in FIG. 1A, CC-NUMA system 100 includes a plurality of sub-systems 110, 120, . . . 180 coupled to each other by a global interconnect 190. Each sub-system includes at least one processor, a corresponding memory management unit (MMU), a corresponding second level cache (L2$), a main memory, a global interface, a directory and a local interconnect. For example, sub-system 110 includes processors 111a, 111b, . . . 111i, MMUs 112a, 112b, . . . 112i, L2$s 113a, 113b, . . . 113i, main memory 114, global interface 115, directory 116 and local interconnect 119. Note that since sub-systems 110, 120, . . . 180 are similar in structure, the following description of sub-system 110 is also applicable to sub-systems 120, . . . 180.
Processors 111a, 111b, . . . 111i are coupled to MMUs 112a, 112b, . . . 112i, respectively. In turn, MMUs 112a, 112b, . . . 112i are coupled to L2$s 113a, 113b, . . . 113i respectively. L2$s 113a, 113b, . . . 113i, main memory 114 and global interface 115 are coupled to each other via local interconnect 119. Directory 116 is coupled to global interface 115.
Referring now to the memory map of FIG. 1B, the total physical address space of CC-NUMA computer system 100 is distributed among main memories 114, 124, . . . 184. Thus, partitioning of the (global) addressable space (GA) is static and is determined before system configuration time, i.e., before the execution of application software. Accordingly, the first time a sub-system 110 needs to read or write to an address location outside its pre-assigned address space, the data has to be fetched from one of sub-systems 120, . . . 180.
In this example, global interface 115 is responsible for tracking the status of data associated with the address space of main memory 114. The status information of each memory location is stored as a memory tag (MTAG) in directory 116. In addition, since global interface 115 is also responsible for maintaining global cache coherency, global interface 115 includes hardware and/or software implemented cache coherency mechanism for maintain coherency between the respective caches and main memories of sub-systems 110, 120, . . . 180.
A typical read request, e.g., a Read.sub.-- To.sub.-- Share (RTS), by processor 111a of sub-system 110 occurs in the following manner as illustrated by the flowcharts of FIGS. 1C and 1D. First, processor 111a presents a virtual address (VA) to MMU 112a which converts the VA into a global address (GA) and presents the GA to L2$ 113a (step 1110). If there is a valid copy of the data line of interest in L2$ 113a, e.g., a shared (S) or owned (O) copy, then L2$ 113a provides the data to processor 111a via MMU 112a, thereby completing the read request (steps 1120, 1125).
Conversely, if L2$ 113a does not have a valid copy, then L2$ 113a presents the GA to the local interconnect 119 of the requesting sub-system 110 (step 1130). If the GA is not part of the requesting sub-system 110's local address space, i.e., the requesting sub-system is not the home sub-system, then the request is forwarded to the appropriate home sub-system, e.g., sub-system 120 (step 1145).
Referring now to FIG. 1D, in the above cases where the data cannot be found in the L2$ of the requesting sub-system 110, the home directory (116 or 126) is updated to reflect this response, for example by marking requesting sub-system 110 as a sharer of the data (step 1148).
Next, if requesting sub-system 110 is also the home sub-system, the corresponding MTAG in directory 116 is checked for an appropriate MTAG state, e.g., modified (M), owned (O) or shared (S) for a read (step 1150). If the MTAG state is inappropriate for the read request, or if requesting sub-system 110 is not home sub-system, directory 126 is checked for an appropriate MTAG state. The directory of the home sub-system has information about which sub-system(s) have valid copies of the data line and which sub-system is the owner of the data line. Note also that home sub-system may or may not be the owner sub-system. Note further that if the requesting sub-system is also the home sub-system, then the MTAG states will provide an indication of whether the transaction is permitted, i.e., the home directory does not need to be involved in the particular transaction.
If the home sub-system (110 or 120) is determined to have a valid copy of the data line, then the home sub-system provides the data to requesting sub-system 110 (step 1162). In the case where requesting sub-system 110 is also the home sub-system, only an internal data transfer is required. Alternatively, where home sub-system 120 is not the requesting sub-system, then global interface 120 of home sub-system 120 responds by retrieving the data line from main memory 124 and sending the requested data line to global interface 115 of requesting sub-system 110 via global interconnect 190.
Conversely, if the home sub-system (110 or 120) does not have a valid copy of the data line, i.e., the home sub-system is not the owner sub-system, then the read request with the GA is forwarded to the global interface of the sub-system who is the owner of the data line of interest, e.g., global interface 185 of owner sub-system 180 (step 1155). Global interface 185 responds by retrieving the data line from one of the L2$s of owner sub-system 180, e.g., owner L2$ 183a, and sending the requested data line to global interface 115 of requesting sub-system 110 via global interconnect 190 (steps 1164).
Upon receiving the data line, global interface 115 forwards the data line to L2$ 113a which provides the data requesting processor 111a (step 1180). The data line can be cached in L2$ off the critical path for subsequent retrieval by processor 111a (1190).
When a location in an L2$, e.g., L2$ 113a, is needed for storing another data value, the old cache line needs to be replaced. In this implementation, cache lines having an S state are replaced "silently", i.e., they do not generate any new transactions in computer system 100. In other words, a sub-system remains marked in the home directory of the cache line as a sharer of the replaced cache line with respect to the rest of system 100. Conversely, replacement of cache lines having either O or M state will generate a WB transaction to the main memory of the sub-system responsible for this GA. As such, the directory associated with the responsible sub-system is updated to reflect this change.
In sum, the architecture of CC-NUMA system 100 is better-suited for executing software programs using small data structures which requires a small number of the available cache lines in L2$ 113a. This is because the small data structures can remain entirely in L2$ 113a while they may be repeatedly accessed. Unfortunately, CC-NUMA system 100 is unable to cache large data structures which are too large to be stored entirely in L2$ 113a, causing a thrashing problem whereby portions of large data structures are repeatedly cached and discarded.
FIGS. 2A, 2B, 2C, 2D and 2E illustrate a simple COMA (S-COMA) computer system 200 which is capable of caching large data structures in their entirety since S-COMA system 200 allocates its cache memory space a page at a time. As shown in the block diagram of FIG. 2A, S-COMA system 200 includes a plurality of sub-systems 210, 220, . . . 280 coupled to each other by a global interconnect 290. Each sub-system includes at least one processor, a corresponding memory management unit (MMU), a corresponding second level cache (L2$), a cache memory, a global interface, an address translator, a directory and a local interconnect For example, sub-system 210 includes processors 211a, 211b, . . . 211i, MMUs 212a, 212b, . . . 212i, L2$s 213a, 213b, . . . 213i, cache memory 214, global interface 215, directory 216, address translator 217 and local interconnect 219. Note that since sub-systems 210,220, . . . 280 are similar in structure, the following description of sub-system 210 is also applicable to sub-systems 220, . . . 280.
Processors 211a, 211b, . . . 211i are coupled to MMUs 212a, 212b, . . . 212i, respectively. In turn, MMUs 212a, 212b, . . . 212i are coupled to L2$s 213a, 213b, . . . 213i respectively. L2$s 213a, 213b, . . . 213i, main memory 214 and global interface 215 are coupled to each other via local interconnect 219. Directory 216 is coupled to global interface 215. Address translator 217 is located between global interface 215 and global interconnect 290.
Referring now to the memory maps of FIGS. 2B and 2C, responsibility for tracking the status of total addressable space of S-COMA system 200 is distributed among the respective home directories of sub-systems 210, 220, . . . 280. Partitioning of the cache memories of S-COMA computer system 200 is dynamic, i.e., cache memories 214, 224, . . . 284 function as attraction memory (AM) wherein cache memory space is allocated in page-sized portions during execution of software as the need arises. Note that cache lines within each (allocated) page are individually accessible.
Hence, by allocating memory space in entire pages in cache memories 214, 224, . . . 284, S-COMA computer system 200 avoids the above-described capacity and associativity problem associated with caching large data structures. By simply replacing main memories 114, 124, . . . 184 with similarly-sized page-oriented cache memories 214, 224, . . . 284, large data structures can now be cached entirely in sub-system 210.
In this example, global interface 215 is responsible for tracking the status of data stored in cache memory 214 of sub-system 210, with the status information stored as memory tags (MTAGs) in a corresponding location within directory 216. In addition, since global interface 215 is also responsible for maintaining global cache coherency, global interface 215 includes hardware and/or software implemented cache coherence mechanism for maintaining coherency between cache 214 of sub-system 210 and the caches of other sub-systems 220, . . . 280. Address translator 217 is responsible for translating local physical addresses (LPAs) into global addresses (GAs) for outbound data accesses and GAs to LPAs for incoming data accesses.
In this implementation, the first time a sub-system, e.g., sub-system 210, accesses a particular page, address translator 217 is unable to provide a valid translation from VA to PA for sub-system 210, resulting in a software trap. A trap handler of sub-system 210 selects an unused page in cache memory 214 to hold data lines of the page. MTAGs of directory 216 associated with the page are initialized to an "invalid" state, and address translator 217 is also initialized to provide translations to/from this page's local physical address (LPA) from/to the unique global address (GA) which is used to refer to this page throughout system 200.
A typical read request, e.g., a read-to-share (RTS) request, by processor 211a of sub-system 210 occurs in the following manner as illustrated by the flowcharts of FIGS. 2D and 2E. First, processor 211a presents a virtual address (VA) to MMU 212a which converts the VA into a LPA and presents the LPA to L2$ 213a (step 2110). If there is a valid copy of the data line of interest in L2$ 213a, e.g., a shared (S), owned (O) or modified (M) copy, then L2$ 213a provides the data to processor 211a, and the read request is completed (steps 2120, 2125).
Conversely, if L2$ 213a does not have a valid copy, then L2$ 213a presents the LPA to global interface 215 (step 2130). Global interface 215 accesses MTAGs of directory 216 to determine if a valid copy of the data line can be found in cache memory 214 (step 2132).
If such a valid copy exist, the data line is retrieved from cache memory 214 (step 2134). The data line is then provided to L2$ 213a which provides the data to processor 211a via MMU 212a, thereby completing the read request (step 2136).
However, if a valid copy of the data line of interest cannot be located in either L2$ 213a or cache memory 214, then requesting address translator 217 converts the LPA to a GA, before sending the data request via global interconnect 290 to the home sub-system whose address space includes the GA of the data line of interest, e.g., sub-system 220 (step 2142). Next, address translator 227 of home sub-system 220 converts the GA into a LPA (step 2144), and looks up the appropriate directory entry to determine if there is a valid copy of the data line in home cache memory 224 (step 2150). This GA to LPA translation in home sub-system 220 can be a trivial function such as stripping an appropriate number of most significant bits (MSBs).
Referring now to FIG. 2E, in each of the above cases where the data line is not found in requesting sub-system 210, home sub-system 220 updates home directory 226, e.g., to reflect a new sharer of the data line (step 2148).
If a valid copy exist in home sub-system 220, global interface 225 responds by retrieving the data line from cache memory 224 or L2$ 223a, before sending the requested data line to global interface 215 of requesting sub-system 210 via global interconnect 290 (step 2162).
Conversely, if home sub-system 220 does not have a valid copy of the data line, then the read request with the GA is forwarded to the address translator of the owner sub-system, e.g., translator 287 (step 2152). Upon receiving the GA from home sub-system 220, address translator 287 of sub-system 280 converts the GA into an LPA for global interface 285 (step 2154). This GA to LPA translation in owner sub-system 280 is a non-trivial function. Next, global interface 285 of owner sub-system 280 responds by retrieving the data line from either cache memory 284 or one of 2L$s 283a, 283b, . . . 283i, and sending the requested data line to global interface 215 of requesting sub-system 210 via global interconnect 290 (step 2164).
When the data line arrives at global interface 215, global interface 215 forwards the data line to L2$ 213a which then provides the data to requesting processor 211a (step 2180). The data line can be cached in L2$ 213a off the critical path for subsequent retrieval by processor 211a thereby completing the read transaction (2190). Note that a GA to LPA translation is not required for returning data.
Occasionally, replacement of (entire) pages stored in cache memory 214 may be needed when cache memory 214 becomes full or is nearly full, in order to make room for allocating new page(s) on a read request. Ideally, sub-system 210 maintains an optimal amount of free pages in cache memory 214 as a background task, i.e., off the critical timing path, ensuring that the attraction memory, i.e., cache memory 214, does not run out of storage space. Upon replacement, a determination of which cache lines of the to-be-replaced page contains valid data (either M, O or S state) is made by accessing the MTAGs stored in directory 216. A message is then sent to the responsible home directory informing the home directory that the cache line is to be replaced.
If the cache line has an M or O state, this transaction is similar to an owner sub-system's Write.sub.-- Back (WB) transaction in CC-NUMA mode, which writes the data value to the home cache memory of home sub-system. If the cache line has an S state, the replacement transaction does not transfer any data, but updates the corresponding directory to reflect the fact that the to-be-replaced node, i.e., sub-system, no longer has a shared copy of the data line. Hence, in S-COMA system 200, replacement is not "silent" since the respective directory is continually updated to reflect any replacement(s) of the data line.
Although S-COMA system 200 is more efficient at caching larger data structures than CC-NUMA system 100, allocating entire pages of cache memory at a time in order to be able to accommodate large data structures is not a cost effective solution for all access patterns. This is because caching entire pages to accommodate large data structures is inefficient when the data structures are sparse or when only a few elements of the structure are actually accessed.
Hence there is a need to provide a hybrid caching architecture together with a cache-coherent protocol for a multi-processor computer system that is flexible and efficient in caching both large and smaller, sparse and packed data structures.
Further, in order to fully exploit the capability of such a hybrid caching architecture, there is also the need for static and/or dynamic algorithms to efficiently select appropriate caching modes while executing programs with a wide variety of data structures and access patterns. Although specialized hardware event tracer for capturing class(es) of events, e.g., bus operations, over time, can be used to optimize caching mode selection, they are expensive and difficult to implement. This is because event ordering and timing capture based on in-circuit emulation (ICE) technology typically involve complicated high-speed analog circuitry and probes. Accordingly, any cache mode selection algorithm(s) should be simple and yet effective event-based histograms which captures some of the same event information, e.g., cache miss, to give some indication of the appropriateness of COMA versus NUMA cache optimization.