1. Field of the Invention
The present invention relates to mechanisms for providing fault-tolerance within computing systems. More specifically, the present invention relates to a method and an apparatus for fixing bit errors encountered during cache references without blocking concurrently executing threads and/or processes.
2. Related Art
Rapid advances in semiconductor technology presently make it possible to incorporate large caches onto a microprocessor chip. For example, some microprocessors include multiple processors and associated level one (L1) caches that access a large level two (L2) cache, wherein all of these structures reside on the same microprocessor chip. Locating the L2 cache on the microprocessor chip dramatically decreases the time required to access the L2 cache, and can thereby increase performance of the microprocessor system.
However, large on-chip caches are susceptible to random bit errors. One solution to this problem is to use error-correcting codes to detect and correct these errors. Semiconductor memories located outside a microprocessor chip often include additional space for storing an error-correcting code for each data word. When a data word is first stored to the memory, an error-correcting code is calculated from the data word, and this error-correcting code is stored along with the data word in the memory. When the data word is subsequently retrieved from the memory, the error-correcting code is also retrieved. At the same time, a new error-correcting code is calculated for the retrieved data word. If the new error-correcting code differs from the retrieved error-correcting code, a bit error has occurred in either the data word of the error-correcting code. In this case, the error-correcting code can be used to correct the bit error.
This process of reading data, detecting an error and correcting the error involves a read-modify-write (RMW) operation. Implementing a RMW operation to correct errors introduces additional delay into a cache access which can greatly reduce computer system performance, and can require additional circuitry that consumes valuable on-chip real estate. Consequently, large on-chip caches presently do not support a RMW operation to detect and correct bit errors. An existing on-chip cache simply checks for data errors during read operations, and if a data error is detected, the entire system simply stops, thereby preventing other requests from accessing the cache. Alternatively, a trap can be generated and overflow buffers can be used to pile up outstanding transactions.
In single-chip multiprocessor systems, a large number of transactions may be outstanding at any given time from multiple processors and threads. Hence, providing a mechanism to stop all transactions, or to pile up outstanding requests, introduces a significant amount of complexity and consumes valuable on-chip real estate.
Hence, what is needed is a method and an apparatus for fixing bit errors encountered during references to an on-chip cache without significantly complicating design of the on-chip cache and without stopping outstanding transactions from progressing through the memory subsystem.