Computer system downtime for maintenance reasons is very undesirable. This is especially true in large-scale computer systems such as the type designed by Cray Research, Inc., the assignee of the present application. Downtime has detrimental effects in all applications of computers. For example, computer system downtime may delay results of lengthy and complex calculations, and this delay could affect product development which relies upon those calculations. In a competitive marketplace, a delay in product delivery could be devastating for a company. If a computer is used for leasing time, downtime could reduce the amount of time leased and hence decrease the corresponding amount of revenue. Other applications of computers likewise suffer during the delay of downtime.
The detrimental effects of computer system downtime may be minimized by increasing the reliability and mean time between failures of the system. One critical factor creating system downtime is memory failure. Following fabrication of memory circuits, some cells may be defective. During system performance, cells may become defective due to operating or environmental conditions. Minimizing these memory failures will help to reduce or prevent system downtime.
A method of compensating for defective cells which result during fabrication is to include dummy cells within the memory array. When the defective cells are discovered during testing of the memory array following fabrication, the dummy cells may be electronically substituted for the defective cells. This process, however, may only be used before the computer system is fully assembled and is not effective during system operation. This process is also permanent and may not be reversed.
Some computer systems use error correction codes to detect and correct memory errors in hardware. The error correction codes may become complex, however, and there is a practical limit to the number of bits which may be corrected by this method. This limit may be determined by the number of additional bits required to implement error correction codes. Furthermore, these additional bits require changes in the capacity of the memory banks, busses, and related circuitry, all of which will severely affect an original memory design.
Some systems use shifting techniques in order to reconfigure memory. These systems shift data around a defective or bad chip on the inputs and outputs to memory. A spare chip effectively replaces the defective or bad chip. When these systems initiate reconfiguration of memory, identical shifting occurs on both the inputs and outputs. Therefore, previously stored data cannot be read from memory in its state as originally stored, because any read operation will occur subject to the shifting of data on the outputs.
A need thus exists for an apparatus for reconfiguring a memory during system operation in order to avoid time-consuming and undesirable system maintenance downtime. A need further exists for a memory reconfiguration apparatus which allows independent shifting on inputs and outputs to memory so that, for example, stored data may be read from memory is its state as stored while data is written to memory in a reconfigured state.