Reliability and availability have become a major concern to businesses and individuals who are becoming more and more dependent upon the computer for sustaining production or development productivity in various environments. Extended outages cannot be tolerated and therefore, these businesses and individuals are placing heavy weight on RAS (relability, availability and serviceability) performance statistics in their purchase considerations.
Future system designs must incorporate techniques to increase RAS performance to satisfy customer demands, but also to decrease the field repair actions and scrap and rework costs.
Previous systems having no reconfiguration capability, require FRU (Field Replaceable Unit) replacement at a failure occurrence. This may cause extensive system outages due to customer engineer arrival delays, trouble-shooting time and procurement and installation of parts. Reconfiguration of the hardware to utilize spare components in place of failing components prevents extensive outages and permits continued operation in the best case or requires a program RESTART or RE-IPL (initial program load) for catastrophic failures.
Several techniques have been applied to increase system reliability and availability by reconfiguration of hardware after a component or unit failure. Specifically, since failure rates have been significantly higher for bipolar arrays than for logic chips in a processor, system design techniques such as reconfiguration have been used to improve the reliability and availability of these bipolar arrays. Such an improvement would make the most significant contribution toward increasing the overall reliability and availability of the processor. These bipolar arrays are generally used in local storage (i.e., register files), cache, control storage RAMs, and channel buffers.
A first type of reconfiguration which has been employed with bipolar arrays is to utilize half of the array depth for functional purposes and to reconfigure portions of the arrays to the unused half of the array as hard errors are enountered. This method is effective for single-cell or multi-cell failures in half of the chip only, but is ineffective for catastrophic chip failures or address related failures affecting both halves, for example.
A second type of reconfiguration which has been used involves the translation of addressing of spare arrays to map failing chip addresses to spare chip addresses. This method is successful for all chip failure modes, but adds significant additional delay (several nanoseconds) to the critical addressing nets, which generally cannot be tolerated in most high performance processor applications, since the cycle time is largely based upon array access times.