1. Field of the Invention
The present invention generally relates to computer systems and, more particularly, to a method of optimizing architectural-level operations such as cache instructions.
2. Description of the Related Art
The basic structure of a conventional computer system 10 is shown in FIG. 1. Computer system 10 may have one or more processing units, two of which 12a and 12b are depicted, which are connected to various peripheral devices, including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as random access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units 12a and 12b communicate with the peripheral devices by various means, including a generalized interconnect or bus 20. Computer system 10 may have many additional components which are not shown, such as serial and parallel ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access memory 16, etc. Also, instead of connecting I/O devices 14 directly to bus 20, they may be connected to a secondary (I/O) bus which is further to an I/O bridge to bus 20. The computer can have more than two processing units.
In a symmetric multi-processor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. A typical architecture is shown in FIG. 1. A processing unit includes a processor core 22 having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. An exemplary processing unit includes the PowerPC.TM. processor marketed by International Business Machines Corporation. The processing unit can also have one or more caches, such as an instruction cache 24 and a data cache 26, which are implemented using high-speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory 16. These caches are referred to as "on-board" when they are integrally packaged with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit 12 can include additional caches, such as cache 30, which is referred to as a level 2 (L2) cache since it supports the on-board (level 1) caches 24 and 26. In other words, cache 30 acts as an intermediary between memory 16 and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, while the may be an IBM PowerPC.TM. 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 30 is connected to bus 20, and all loading of information from memory 16 into processor core 22 must come through cache 30. Although FIG. 1 depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels of serially connected caches.
A cache has many blocks or lines which individually store the various instructions and data values. An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system (indicate the validity of the value stored in the cache). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming effective address with one of the tags within the address tag field indicates a cache "hit." The collection of all of the address tags in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and the collection of all of the value fields is the cache entry array. The cache 30 of FIG. 1 depicts such a cache entry array 32 and a cache directory 34.
When all of the blocks in a set for a given cache are full and that cache receives a request, whether a "read" or "write," to a memory location that maps into the full set, the cache must "evict" one of the blocks currently in the set. The cache chooses a block by one of a number of means known to those skilled in the art (least recently used (LRU), random, pseudo-LRU, etc.) to be evicted. An LRU unit is depicted in FIG. 1. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache (in the case of the LI or on-board cache) or main memory (in the case of an L2 cache, as depicted in the two-level architecture of FIG. 1). By the principle of inclusion, the lower level of the hierarchy will already have a block available to hold the written modified data. However, if the data in the chosen block is not modified, the block is simply abandoned and not written to the next lowest level in the hierarchy. This process of removing a block from one level of the hierarchy is known as an "eviction." At the end of this process, the cache no longer holds a copy of the evicted block.
A conventional cache has many queues: cacheable store queues 38 (which may include read and write queues for each of the cache directory, cache entry array, and other arrays, to fetch data coming in to reload this cache); a cacheinhibited store queue 40; a snoop queue 42 for monitoring requests to, e.g., intervene some data; and a cache operations queue 44 which handles cache instructions that execute control at an architectural level. For example, the PowerPC.TM. processor utilizes certain instructions that specially affect the cache, such as a flush instruction, a kill instruction, a clean instruction, and a touch instruction. These instructions are stored in cache operations queue 44.
Cache instructions allow software to manage the cache. Some of the instructions are supervisory level (performed only by the computer's operating system), and some are user level (performed by application programs). The flush instruction (data cache block flush--"dcbf") causes a cache to be made available by invalidating the cache block if it contains an unmodified ("shared" or "exclusive") copy of a memory block or, if the cache block contains a modified copy of a memory block, then by first writing the modified value downward in the memory hierarchy (a "push"), and thereafter invalidating the block. The kill instruction (data cache block invalidate--"dcbi," instruction cache block invalidate--"icbi," or data cache block set to zero--"dcbz") is similar to the flush instruction except that a kill instruction immediately forces a cache block to an invalidate state, so any modified block is killed without pushing it out of the cache. The clean instruction (data cache block store--"dcbst") causes a block that has been modified to be written to main memory; it affects only blocks which have been modified. The touch instruction (data cache block touch--"dcbt") provides a method for improving performance through the use of software-initiated prefetch hints.
All of the foregoing cache instructions operate on a block whose size is referred to as the processor coherency granule. For many computers, the processor coherency granule is 32 bytes, i.e., the processor can operate on a 32-byte sector in a cache block of the L1 cache. The system bus granule may, however, be larger, for example, 64 bytes or 128 bytes, i.e., the full size of the cache line that is transmitted from the L2 cache to the system bus is 64 bytes or 128 bytes. In other words, an instruction sent along the system bus references a 64-byte word or a 128-byte word, not just 32 bytes. Coherency sizes can vary further, for example, having three coherency sizes with a two-level cache (a processor coherency granule of 32 bytes, an L1 coherency granule of 64 bytes, and an L2/system bus coherency granule of 128 bytes).
This variation in coherency size along the memory hierarchy can lead to certain inefficiencies. For example, if a processor issues an "icbi" instruction to a particular 32-byte sector, that instruction will be sent along the system bus and be treated as a 64-byte instruction; then, if the processor immediately issues another "icbi" instruction for another 32-byte sector that was part of the same 64-byte word as the previous instruction, then traditional systems will send a second 64-byte "icbi" instruction to the same 64-byte word, even though a single system bus instruction would have sufficed to kill the two adjacent 32-byte sectors. Another problem can arise when two different processes or threads have issued instructions which result in redundant performance of the same cache instruction. For example, the cache operations queue may include two "icbi" instructions with the same operand, i.e., acting on exactly the same 32-byte cache sector. These instructions are then redundantly repeated.
Another problem relating to the coherency granularity is that a smaller granule increases the number of instructions that are required to complete certain large-scale procedures. For example, a procedure might be performing page-level operations such as copying several pages of memory (a page is a plurality of contiguous memory blocks). If a page were 4 kilobytes and the processor coherency granule were 32 bytes, then a processor performing a flush on an entire page would have to issue 128 "dcbf" instructions, but if the coherency granule were 64 bytes or more, then the number of instructions would be reduced proportionately. This result leads to performance degradation when a procedure is doing many page-level cache operations. Performance is further decreased as the number of processors increases since a processor issuing the cache instructions must wait for snoop responses from all of the other processors before it is sure that the instruction has been completed.
Consider further how a second processor responds to the cache instructions issued by the first processor. If the processor coherency granule is 32 bytes and the system bus granule is 128 bytes, then when the first processor wants to flush a 32-byte sector, the second processor ends up snooping a 128-byte flush. So even though the first processor just wanted to flush a single 32-byte sector, four such sectors will have to be flushed in the cache of the second processor. This problem is exacerbated, however, when page-level cache operations such as those described above are performed, which result in a large number (128) of such 128-byte snooped flushes. The instructions and ensuing responses create a significant amount of address traffic. It would, therefore, be desirable to devise a method of handling large scale architectural operations, such as page-level cache instructions, which decreased bus traffic. It would be further advantageous if the method could result in a decreased number of cache instructions that have to be performed, regardless of variations in the coherency granule of the memory hierarchy, and in quicker execution of those instructions.