1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to a method of improving the performance of a cache used by a processor of a computer system, by reducing delays associated with parity checks and error correction codes.
2. Description of the Related Art
The basic structure of a conventional computer system 10 is shown in FIG. 1. Computer system 10 may have one or more processing units, two of which 12a and 12b are depicted. Processing units 12a and 12b are connected to various peripheral devices including input/output (I/O) devices 14 (such as a display monitor, keyboard, and permanent storage device), memory device 16 (such as random access memory or RAM) that is used by the processing units to carry out program instructions, and firmware 18 whose primary purpose is to seek out and load an operating system from one of the peripherals (usually the permanent memory device) whenever the computer is first turned on. Processing units 12a and 12b communicate with the peripheral devices by various means, including a generalized interconnect or bus 20. Computer system 10 may have many additional components which are not shown, such as serial and parallel ports for connection to, e.g., modems or printers. Those skilled in the art will further appreciate that there are other components that might be used in conjunction with those shown in the block diagram of FIG. 1; for example, a display adapter might be used to control a video display monitor, a memory controller can be used to access memory 16, etc. Also, instead of connecting I/O devices 14 directly to bus 20, they may be connected to a secondary (I/O) bus which is further connected to an I/O bridge to bus 20. The computer can have more than two processing units.
In a symmetric multi-processor (SMP) computer, all of the processing units are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. A typical architecture is shown in FIG. 1. A processing unit includes a processor core 22 having a plurality of registers and execution units, which carry out program instructions in order to operate the computer. An exemplary processing unit includes the PowerPC(trademark) processor marketed by International Business Machines Corp. The processing unit can also have one or more caches, such as an instruction cache 24 and a data cache 26, which are implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from memory 16. These caches are referred to as xe2x80x9con-boardxe2x80x9d when they are integrally packaged with the processor core on a single integrated chip 28. Each cache is associated with a cache controller (not shown) that manages the transfer of data between the processor core and the cache memory.
A processing unit 12 can include additional caches, such as cache 30, which is referred to as a level 2 (L2) cache since it supports the on-board (level 1) caches 24 and 26. In other words, cache 30 acts as an intermediary between memory 16 and the on-board caches, and can store a much larger amount of information (instructions and data) than the on-board caches can, but at a longer access penalty. For example, cache 30 may be a chip having a storage capacity of 256 or 512 kilobytes, while the processor may be an IBM PowerPC(trademark) 604-series processor having on-board caches with 64 kilobytes of total storage. Cache 30 is connected to bus 20, and all loading of information from memory 16 into processor core 22 usually comes through cache 30. Although FIG. 1 depicts only a two-level cache hierarchy, multi-level cache hierarchies can be provided where there are many levels of interconnected caches.
A cache has many xe2x80x9cblocksxe2x80x9d which individually store the various instructions and data values. The blocks in any cache are divided into groups of blocks called xe2x80x9csetsxe2x80x9d or xe2x80x9ccongruence classes.xe2x80x9d A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset mapping functions. The number of blocks in a set is referred to as the associativity of the cache, e.g., 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into; however, several different blocks in main memory can be mapped to any given set. A 1-way set associate cache is direct mapped, that is, there is only one cache block that can contain a particular memory block. A cache is said to be fully associative if a memory block can occupy any cache block, i.e., there is one congruence class, and the address tag is the full address of the memory block.
An exemplary cache line (block) includes an address tag field, a state bit field, an inclusivity bit field, and a value field for storing the actual instruction or data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system (indicating the validity of the value stored in the cache). The address tag is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the tags within the address tag field indicates a cache xe2x80x9chit.xe2x80x9d The collection of all of the address tags in a cache is referred to as a directory (and sometimes includes the state bit and inclusivity bit fields), and the collection of all of the value fields is the cache entry array.
When all of the blocks in a congruence class for a given cache are full and that cache receives a request, whether a xe2x80x9creadxe2x80x9d or xe2x80x9cwrite,xe2x80x9d to a memory location that maps into the full congruence class, the cache must xe2x80x9cevictxe2x80x9d one of the blocks currently in the class. The cache chooses a block by one of a number of means known to those skilled in the art (least recently used (LRU), random, pseudo-LRU, etc.) to be evicted. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache (in the case of the L1 or on-board cache) or main memory (in the case of an L2 cache, as depicted in the two-level architecture of FIG. 1). By the principle of inclusion, the lower level of the hierarchy will already have a block available to hold the written modified data. However, if the data in the chosen block is not modified, the block is simply abandoned and not written to the next lowest level in the hierarchy. This process of removing a block from one level of the hierarchy is known as an xe2x80x9cevictionxe2x80x9d. At the end of this process, the cache no longer holds a copy of the evicted block.
FIG. 2 illustrates the foregoing cache structure and eviction process. A cache 40 (L1 or a lower level) includes a cache directory 42, a cache entry array 44, an LRU array 46, and control logic 48 for selecting a block for eviction from a particular congruence class. The depicted cache 40 is 8-way set associative, and so each of the directory 42, cache entry array 44 and LRU array 46 has a specific set of eight blocks for a particular congruence class as indicated at 50. In other words, a specific member of the congruence class in cache directory 42 is associated with a specific member of the congruence class in cache entry array 44 and with a specific member of the congruence class in LRU array 46, as indicated by the xe2x80x9cXxe2x80x9d shown in congruence class 50.
A bit in a given cache block may contain an incorrect value, either due to a soft error (a random, transient condition caused by, e.g., stray radiation or electrostatic discharge) or to a hard error (a permanent condition, e.g., defective cell). One common cause of errors is a soft error resulting from alpha radiation emitted by the lead in the solder (C4) bumps used to form wire bonds with circuit leads. Most errors are single-bit errors, that is, only one bit in the field is incorrect. In order to deal with potential errors, each of the blocks in directory 42 is connected to the control logic via an error correction code (ECC) circuit 52 which can be used to reconstruct the proper data stream. Some ECCs can only be used to detect and correct single-bit errors, i.e., if two or more bits in a particular block are invalid then the ECC might not be able to determine what the proper data stream should actually be, but at least the failure can be detected. Other ECCs are more sophisticated and even allow detection or correction of double errors. These latter errors are costly to correct, but the design tradeoff is to halt the machine when double-bit errors occur. Although only directory 42 is shown with ECC circuits, these circuits can similarly be used with other arrays, such as cache entry array 44.
The outputs of ECC circuits 52, whose values correspond to (corrected) memory block addresses, are connected to respective comparators 54 each of which also receives the address of the requested memory block. If a valid copy of a requested memory block is in the congruence class 50, then one, and only one, of the comparators 54 will output an active signal. The outputs of comparators 54 are connected to a multiplexer 56 and also to an OR gate 58, whose output controls multiplexer 56. If a cache hit occurs (a requested address matches with an address in cache directory 42), then OR gate 58 activates multiplexer 56 to pass on a signal indicating which member of the congruence class matches the address. This signal controls another multiplexer 60 which receives inputs from each of the entries in cache entry array 44. In this manner, when a cache hit in the directory occurs, the corresponding value is passed through multiplexer 60 to a bus 62.
If a cache miss occurs, and if all of the blocks in the particular congruence class 50 already have valid copies of memory blocks, then one of the cache blocks in congruence class 50 must be selected for victimization. This selection is performed using the LRU bits for the congruence class in LRU array 46. For each cache block in the class, there are a plurality of LRU bits, for example, three LRU bits per block for an 8-way set associative cache. The LRU bits from each block in the class are provided as inputs to a decoder 64 having an 8-bit output to indicate which of the blocks is to be victimized. This output is coupled to multiplexer 56. In this manner, if OR gate 58 is not active, multiplexer 56 passes on an indication of the cache block to be used based on the outputs of decoder 64.
There are several disadvantages and limitations in the foregoing cache construction with respect to ECC circuit design. First, ECC circuits 52 are fairly complex and take up space on the chip. Second, they seriously slow down processing since they are in the critical (timing) path for retrieving the cached values (either from directory or cache). Third, the ECC circuits require extra bits to be included in the address tag. For example, a 32-bit address tag requires a 7-bit ECC, so the effective size of the tag is 39 bits. These extra ECC bits must be transmitted along with the other bits, taking up bus bandwidth, and also require that the cache array be physically larger. Again, for example, if a cache directory were 8-way associative and each address tag had 7 ECC bits, then a total of 56 ECC bits would be required for each congruence class. The problem is clearly aggravated in caches of very large sizes. Conversely stated, the present of so many ECC bits makes it more difficult to place a given size cache on a smaller area of silicon. In light of the foregoing, it would be desirable to devise a cache construction allowing single-bit error correction and double-bit error detection, but which provided faster access to the directory or cache entry array. It would be further advantageous if the cache construction provided such error correction and detection without requiring the large number of ECC bits used in prior art schemes.
It is therefore one object of the present invention to provide an improved cache to be used by a processor of a computer system.
It is another object of the present invention to provide such a cache which allows for single-bit error correction and double-bit error detection using error correcting codes, but which does not introduce a significant delay in cache operation if correction of a value or tag is not required.
It is yet another object of the present invention to provide such a cache using error correcting codes, but having fewer ECC bits than conventional designs.
The foregoing objects are achieved in a method of checking for errors in a cache array used by a processor of a computer system, generally comprising the steps of mapping a plurality of memory blocks to one or more congruence classes of the array, loading a plurality of values into cache blocks of the congruence class(es), comparing a requested value associated with one of the memory blocks to the values loaded in the cache blocks and determining, concurrently with said comparing step, whether the cache blocks collectively contain at least one error (such as a soft error caused by stray radiation). The method preferably includes the step of performing separate parity checks on each of the cache blocks. If a parity error is detected for any block, an error correction code is executed for the entire congruence class, i.e., only one set of ECC bits are used for the combined cache blocks forming the congruence class. Parity checkers are connected in parallel with comparators used for the comparison, to achieve error detection concurrent with the comparison. The output of the parity checkers can be connected to an OR gate whose output is further used to activate the ECC for the congruence class. If an error is detected and the ECC is executed, the cache operation is retried after ECC execution. The present invention can be applied to a cache directory containing address tags, or to a cache entry array containing the actual instruction and data values. This novel method allows the ECC to perform double-bit error as well, but a smaller number of error checking bits is required as compared with the prior art, due to the provision of a single ECC field for the entire congruence class. For example, in the previous prior art ECC scheme, an 8-way set associative cache with 32 bits of tag would reuqire a total of 56 ECC bits; with the present invention, the number of required ECC bits is reduced to 10. This construction not only leads to smaller cache array sizes, but also to faster overall operation by avoiding unnecessary ECC circuit operations.
The above as well as additional objectives, features, and advantages of the present invention will become apparent in the following detailed written description.