The present invention is directed to computing systems, and more specifically to an apparatus and method of identifying and logically deleting a defective subdivision of a cache of a computing system.
Increased business reliance on computers today has made it vital for many computers, particularly server computers, to remain operative continuously, twenty-four hours a day. However, inevitably, a defect occurs in a computer that disrupts a service being provided by the computer. Nowadays, a service interruption or even a slow-down in service provided by a computer can cause an interruption in the business, which potentially costs the business owner much more than the cost of repairing the computer. Perhaps more than other elements of a computer, memory elements become defective during use. For some defects, the impact to a business can be significant, if the system is rendered inoperative by the defect and requires the memory element to be replaced before the system can be operated again.
One way that industry has addressed this concern is to strive for more reliable memory design and production. However, as the density and scale of integration of memory elements increase, it is inevitable that there will be some defects that require the memory element to be replaced, in order to make the repair. It is the system downtime for repairs of these types of defects that still needs to be addressed. These concerns are felt particularly strong with respect to cache memory utilized by a processor. Cache memory is used to provide quick access to frequently referenced and manipulated data and instructions. A level one (L1) cache memory (hereinafter, “cache memory”) typically is integrated in a processor element of a computing system. A processor requires a cache memory having some minimum number of memory locations in order to achieve best processing performance.
When the cache memory becomes defective for one reason or another, conventional approaches have permitted individual storage elements of a cache memory such as wordlines and columns to be internally deleted and/or replaced by redundancy elements, in order to permit the processor to be operated again after detecting a permanent defect. In recent years, improved testing and internal self-repair mechanisms have permitted this type of repair operation to be performed by the computer system itself. However, self-repair is generally not available to replace large portions of a cache memory. In addition, self-repair cannot remedy a condition in which a normally repairable portion of a cache memory fails but cannot be repaired because all available repair actions have already been used. When a portion of a cache memory becomes defective in a manner that cannot be repaired by internal mechanisms, the conventional response is to declare the entire cache memory defective when this defect is discovered during the self-test step of powering-on the computer. This then usually requires that the entire processor that utilizes the cache memory be taken offline, i.e., removed from the system configuration. In some instances, the response requires that the entire computing system, having multiple processors, be powered down, and not merely the processor that has the failing cache memory. The computing system would then await repair by physical removal of a part of the system containing the failing processor and replacement thereof by a failure replaceable unit (FRU). Clearly, such outcome was undesirable, as it caused reduced availability of the system to the customer, or even complete unavailability. A far more desirable outcome would be to permit the computing system to retain the processor having the failing subsection of the cache memory in the configuration and continue operating, and logically (but not physically) remove the failing subsection of the cache memory from the configuration instead.
In view of the foregoing, it would be desirable to provide a mechanism by which an unrepairable subsection of a cache memory is identified and logically deleted from the configuration of the computing system, to permit the computing system to continue operating with greater system availability than heretofore.