1. Field of the Invention
The present invention generally relates to computer systems, and more particularly to error detection and correction in a cache memory of a computer processing unit.
2. Description of the Related Art
The basic structure of a conventional symmetric multi-processor computer system 10 is shown in FIG. 1. Computer system 10 has one or more processing units arranged in one or more processor groups; in the depicted system, there are four processing units 12a, 12b, 12c and 12d in processor group 14. The processing units communicate with other components of system 10 via a system or fabric bus 16. Fabric bus 16 is connected to one or more service processors 18a, 18b, a system memory device 20, and various peripheral devices 22. A processor bridge 24 can optionally be used to interconnect additional processor groups. System 10 may also include firmware (not shown) which stores the system's basic input/output logic, and seeks out and loads an operating system from one of the peripherals whenever the computer system is first turned on (booted).
System memory device 20 (random access memory or RAM) stores program instructions and operand data used by the processing units, in a volatile (temporary) state. Peripherals 22 may be connected to fabric bus 16 via, e.g., a peripheral component interconnect (PCI) local bus using a PCI host bridge. A PCI bridge provides a low latency path through which processing units 12a, 12b, 12c and 12d may access PCI devices mapped anywhere within bus memory or I/O address spaces. The PCI host bridge interconnecting peripherals 22 also provides a high bandwidth path to allow the PCI devices to access RAM 20. Such PCI devices may include a network adapter, a small computer system interface (SCSI) adapter providing interconnection to a permanent storage device (i.e., a hard disk), and an expansion bus bridge such as an industry standard architecture (ISA) expansion bus for connection to input/output (I/O) devices including a keyboard, a graphics adapter connected to a display device, and a graphical pointing device (mouse) for use with the display device. The service processors can alternately reside in a modified PCI slot which includes a direct memory access (DMA) path.
In a symmetric multi-processor (SMP) computer, all of the processing units 12a, 12b, 12c and 12d are generally identical, that is, they all use a common set or subset of instructions and protocols to operate, and generally have the same architecture. As shown with processing unit 12a, each processing unit may include one or more processor cores 26a, 26b which carry out program instructions in order to operate the computer. An exemplary processing unit includes the POWER5™ processor marketed by International Business Machines Corp. which comprises a single integrated circuit superscalar microprocessor having various execution units, registers, buffers, memories, and other functional units, which are all formed by integrated circuitry. The processor cores may operate according to reduced instruction set computing (RISC) techniques, and may employ both pipelining and out-of-order execution of instructions to further improve the performance of the superscalar architecture.
Each processor core 26a, 26b includes an on-board (L1) cache (typically, separate instruction and data caches) implemented using high speed memory devices. Caches are commonly used to temporarily store values that might be repeatedly accessed by a processor, in order to speed up processing by avoiding the longer step of loading the values from system memory 20. A processing unit can include another cache such as a second level (L2) cache 28 which, along with a memory controller 30, supports both of the L1 caches that are respectively part of cores 26a and 26b. Additional cache levels may be provided, such as an L3 cache 32 which is accessible via fabric bus 16. Each cache level, from highest (L1) to lowest (L3) can successively store more information, but at a longer access penalty. For example, the on-board L1 caches in the processor cores might have a storage capacity of 128 kilobytes of memory, L2 cache 28 might have a storage capacity of 4 megabytes, and L3 cache 32 might have a storage capacity of 32 megabytes. To facilitate repair/replacement of defective processing unit components, each processing unit 12a, 12b, 12c, 12d may be constructed in the form of a replaceable circuit board, pluggable module, or similar field replaceable unit (FRU), which can be easily swapped, installed in, or swapped out of system 10 in a modular fashion.
A cache has many memory blocks which individually store the various instructions and data values. The blocks in any cache are divided into groups of blocks called sets or congruence classes. A set is the collection of cache blocks that a given memory block can reside in. For any given memory block, there is a unique set in the cache that the block can be mapped into, according to preset mapping functions. The number of blocks in a set is referred to as the associativity of the cache, e.g. 2-way set associative means that for any given memory block there are two blocks in the cache that the memory block can be mapped into; however, several different blocks in main memory can be mapped to any given set. A 1-way set associate cache is direct mapped, that is, there is only one cache block that can contain a particular memory block. A cache is said to be fully associative if a memory block can occupy any cache block, i.e., there is one congruence class, and the address tag is the full address of the memory block.
An exemplary cache line (block) includes an address field, a state bit field, an inclusivity bit field, and a value field for storing the actual program instruction or operand data. The state bit field and inclusivity bit fields are used to maintain cache coherency in a multiprocessor computer system (to indicate the validity of the value stored in the cache). The address field is a subset of the full address of the corresponding memory block. A compare match of an incoming address with one of the address fields (when the state field bits designate this line as currently valid in the cache) indicates a cache “hit.” The collection of all of the address fields in a cache (and sometimes the state bit and inclusivity bit fields) is referred to as a directory, and the collection of all of the value fields is the cache entry array.
When all of the blocks in a congruence class for a given cache are full and that cache receives a request, whether a read or write operation, to a memory location that maps into the full congruence class, the cache must “evict” one of the blocks currently in that class. The cache chooses a block by one of a number of means known to those skilled in the art (least recently used (LRU), random, pseudo-LRU, etc.) to be evicted. If the data in the chosen block is modified, that data is written to the next lowest level in the memory hierarchy which may be another cache (in the case of the L2 or on-board cache) or main memory (in the case of an L3 cache, as depicted in the three-level architecture of FIG. 1). If the data in the chosen block is not modified, the block can optionally be abandoned and not written to the next lowest level in the memory hierarchy, i.e., if the next lower level is system memory the non-modified line is abandoned; if the next level in the hierarchy is another cache, the shared copy can be moved. At the end of this process, the cache no longer holds a copy of the evicted block.
The control logic for a cache memory, and in particular a cache directory, may include error correction code (ECC) circuits to handle errors that arise in a cache line. A bit in a given cache block may contain an incorrect value either due to a soft error (such as stray radiation or electrostatic discharge) or to a hard error (a defective cell). ECCs can be used to reconstruct the proper data stream. Some ECCs can only be used to detect double-bit errors and correct single-bit errors, i.e., if two bits in a particular block are invalid, then the ECC will not be able to determine what the proper data stream should actually be, but at least the failure can be detected. Other ECCs are more sophisticated and even allow detection of triple-bit errors and correction of double errors. These latter errors are costly to correct, but the design tradeoff is to halt the machine when double-bit (uncorrectable) errors occur.
These ECC circuits are one way to deal with soft errors arising in memory cells. Another approach used for dealing with hard errors is to provide redundancy within the arrays (directory, LRU, cache). When a cache chip is fabricated, it can be tested to determine if there are any defective row or column lines in each of the arrays (row and column lines are tested for the entire cache, directory, and LRU). If an array is defective, a fuse can be permanently blown to indicate its defective nature. A comparison is then made inside the array for each accessed address to see if it matches with a defective address. If so, appropriate logic re-routes the address to one of many extra row and column lines formed on the chip, i.e., from redundant bit lines (columns) and word lines (rows). The number of extra bit and word lines may vary depending upon the defect rate and desired chip yield. For a low-defect (larger physical size) cache, two extra lines might be provided for every 256 regular lines, while in a high-defect (smaller physical size) cache, two extra lines might be provided for every eight regular lines.
With advancements in chip fabrication and computer configurations, L2 and L3 caches are increasing in size, requiring larger on-chip directories and on-chip (or off-chip) data cache entry arrays. These larger, dense arrays decrease the reliability of the overall chip/system due to increased chances of defects that occur in manufacturing or in the field. In order to increase the reliability of these larger directory/data caches, many different means have been traditionally employed to address these problems, such as in-line parity or ECC detection/correction, but there are several disadvantages and limitations with the foregoing approaches. While soft errors (i.e., intermittent faults) are correctable using ECC circuits that repair and re-write the data in the directory, this technique does not solve hard faults where a cache directory bit is stuck either high or low. This situation is particularly problematic when the stuck bit is one of the coherency (state) bits that are supposed to indicate the validity of the line. In-line ECC correction can be used to correct stuck faults, but this approach penalizes access time to the array, since correction is needed with each access, and repeatedly consumes part of the error correction capability. The use of redundant cache lines can partially overcome hard faults, but these redundant structures are wasteful as they take up valuable space on the chip or system board and generally require the machine to be re-booted for them to take effect. Redundancy is also limited in its ability to correct a large number of defects. Moreover, hard errors that arise after testing may not be correctable using redundant lines. When these types of hard faults occur, conventional ECC circuits that try to repair and re-write the data will lead to a situation wherein the system repetitively attempts to correct the error without success. In this situation, the machine cannot recover and must be brought down and repaired, costing customers time and money, if full error correction and detection resources are to be maintained.
In light of the foregoing, it would be desirable to devise an improved method of handling hard errors that arise in a cache directory. It would be further advantageous if the method could be implemented without requiring wasteful redundant circuitry or in-line correction which penalizes directory access time and consequently degrades system performance.