1. Field of the Invention
This invention relates in general to cache operations and more particularly to a method, apparatus and program storage device for performing a self-healing cache process.
2. Description of the Prior Art
A computer system typically includes a processor coupled to a hierarchical storage system. The hardware can dynamically allocate parts of memory within the hierarchy for addresses deemed most likely to be accessed soon. The type of storage employed in each staging location relative to the processor is normally determined by balancing requirements for speed, capacity, and costs.
Computer processes continually refer to this storage over their executing lifetimes, both reading from and writing to the staged stored system. These references include self-referencing as well as references to every type of other process, overlay or data. It is well-known in the art that data storage devices using high-speed random access memories (RAM) can be referenced orders of magnitude faster than high volume direct-access storage devices (DASD's) using rotating magnetic media. Such electronic RAM storage relies upon high-speed transfer of electrical charges over small distances, while DASD's typically operate mechanically by rotating a data storage position on a magnetic disk with respect to read-write heads. The relative cost of a bit of storage for DASD and RAM makes it necessary to use DASD for bulk storage and electronic RAM for processor internal memory and caching.
A commonly employed memory hierarchy includes a special, high-speed memory known as cache, in addition to the conventional memory which includes main memory and bulk memory. Cache memory may be arranged in a variety of configurations.
For example, in a cache that uses direct mapping, a data item in main memory is always stored at the same location in the cache. Its address is split into three parts: the first is used to address the line that the data can be found in, the second is stored in the tag RAM to uniquely identify the data in the cache, and the third is the offset of the data in the line.
Fully associative caches are configured to allow any line in memory to be stored at any location in the cache. Associative caches use fewer bits to address the cache line than would be necessary to uniquely identify it. Thus data items can map to multiple lines. Most caches are n-way associative caches, meaning most caches can map data to n distinct locations. In order to identify which data is in the cache, more bits have to be used in the tag RAM. Due to the multiple lines in which a data item may be stored, accessing the cache requires a search process. In order to keep the speed of the lookup operation at the level of a direct mapped cache, the search process must be performed in parallel, requiring n comparators for an n-way associative cache. Advantages of implementing an n-way associative cache include, data items that would map to the same line in a direct mapped cache can now be stored in the cache at the same time, trashing is reduced and utilization improved. Cache memory can be dedicated for instructions or for data. Caches can also exist for addresses translation entries and other types of data.
Cache memory speed increases the apparent access times of the slower memories by holding the words that the CPU is most likely to access. For example, a computer may use a cache memory that resides between the external devices and main memory, called a disk cache, or between main memory and the CPU, called a CPU cache.
A high-speed CPU cache enables relatively fast access to a subset of data instructions, which were previously transferred from main storage to the cache, and thus improves the speed of operation of the data processing system. Cache memory may also be used to store recently accessed blocks from secondary storage media such as disks. This cache memory could be processor buffers contained in main memory or a separate disk cache memory located between secondary and main storage.
A disk cache is a memory device using a semiconductor RAM or SRAM and is designed to eliminate an access gap between a high-speed main memory and low-speed large-capacity secondary memories such as magnetic disk units. The disk cache is typically in a magnetic disk controller arranged between the main memory and a magnetic disk unit, and serves as a data buffer.
The principle of a disk cache is the same as that of a central processing unit (CPU) cache. When the CPU accesses data on disk, the necessary blocks are transferred from the disk to the main memory. At the same time, they are written to the disk cache. If the CPU subsequently accesses the same blocks, they are transferred from the disk cache and not from the disk, resulting in substantially faster accesses.
Other levels of cache are available separate from disk cache and CPU cache. For example, L1 cache, from Level 1 cache, is known as the primary cache and is built into a microprocessor. L1 cache is the smallest and fastest cache level. L2 cache, short for Level 2 cache, is a second level of cache that is larger and slower compared to L1 cache. L2 cache, also called the secondary cache, may be found on a separate chip from the microprocessor chip or may be incorporated into a microprocessor chip's architecture. Other layers of cache may also be implemented. Disk cache, in relation to L1 and L2 caches, is much slower and larger. The CPU cache, comparatively speaking, is very slow and very large.
When a request to read from memory can be satisfied from the cache without using the main memory, the cache controller behaves differently depending on the cache type. For a read operation, the controller selects the data from the cache line and transfers it into a CPU register; the RAM is not accessed and the CPU efficiency increases. For a write operation, the controller may implement one of two basic strategies called write-through and write-back. In a write-through operation, the controller always writes into both RAM and the cache line, effectively switching off the cache for write operations. In a write-back or copyback operation, only the cache line is updated, and the contents of the RAM are left unchanged. After a write-back or copyback operation the RAM must eventually be updated. When a cache miss occurs, the cache line is written to memory, if necessary, and the correct line is fetched from RAM into the cache entry.
With performance being very important and semiconductor geometries shrinking, it is becoming common to implement large caches. L2 caches of 512K to 1 MB are becoming ordinary. However, these caches can experience many types of failures, some of them being transient and some permanent. Caches may have built-in failure checks and may use either parity or error correction code (ECC) methods for detecting errors. For example, parity checks require an extra bit for every 8 bits of data and check for memory errors using even parity or odd parity checks. For even parity, when the 8 bits in a byte receive data, the chip adds up the total number of 1s. If the total number of 1s is odd, the parity bit is set to 1. If the total is even, the parity bit is set to 0. When the data is read back out of the bits, the total is added up again and compared to the parity bit. If the total is odd and the parity bit is 1, then the data is assumed to be valid and is sent to the CPU. But if the total is odd and the parity bit is 0, the chip knows that there is an error somewhere in the 8 bits and dumps the data. Odd parity works the same way, but the parity bit is set to 1 when the total number of 1s in the byte is even.
Parity checking can detect all single bit errors and is 50% effective against random corruption. However, parity does nothing to correct them. If a byte of data does not match its parity bit, then the data are discarded and the system must recover. This problem can reduce cache efficiency and performance.
Some memory caches use a form of error checking known as error-correction code (ECC). Like parity, ECC uses additional bits to monitor the data in each byte. The difference is that ECC uses several bits for error checking instead of one. ECC memory uses a special algorithm not only to detect single bit errors, but actually correct them as well. For example, Many memory systems and caches use some type of Hamming code to perform ECC. ECC memory will also detect instances when more than one bit of data in a word fails. Such failures are not correctable, even with ECC and software is left with little choice but to reset and start over and determine if the problem reoccurs. When permanent errors are present in any type of cache, the same error is repeatedly detected resulting in decreased cache efficiency and performance.
Recovery from uncorrectable errors may involve a lengthy process of resetting and reloading all the code to the computer system. During this time, the customer may not be able to perform useful work, or if it is a redundant system, there exists the increased risk that a code bug during the recovery process could bring down the entire system. Error recovery code is very difficult to test and does tend to have higher error rates than code that is executed more frequently.
It can be seen that there is a need for a method, apparatus and program storage device for increasing cache performance, efficiency and recoverability.