Digital computers have been utilized for many years for performing many useful tasks. All digital computers today have memories which are used to store both data and computer instructions. Such memories are virtually all solid state devices which have the advantage of being inexpensive to manufacture and yet provides high speed operation.
These memories are today comprised of a plurality of chips which each have typically 1 million or more addressable single bit positions. By configuring a plurality of such chips rows and columns so as to respond to an address, a data word can be stored or retrieved from the addressed location. In small computer systems, each addressable location comprises a byte or 8 data bits while in large computer systems, an addressable location typically comprises one, two or even four 32 bit words. In all such systems, each addressable location may also have parity bits for elementary checking of the memory.
With the very large number of memory chips that are required in large computer system memories, the probability increases that one or more memory chips may fail in use thereby making the memory output unreliable unless some measure of fault tolerance is built into the system. As a result, memories have been developed which are capable of operating with at least some defective storage locations. Examples of such memory systems are found in U.S. Pat. No. 3,331,058 and 3,436,734.
In some prior art memories, when a given location is determined to be defective, the defective location is bypassed. In other memories, an auxiliary memory is employed to store the data which is desired to be stored at a defective location. Suitable control circuitry is provided to make sure if a defective location is addressed, that the data is either read or stored into the auxiliary memory. An example of such a system is found in U.S. Pat. No. 4,450,559.
Another approach to solving the problem of defective memory location has been to utilize error detection and correction techniques. In using such an approach, the memory is designed so that each data word read from memory has both data bits and error detecting and correcting bits. The easiest and well known check is to do a parity check to isolate where in the data word an error has occurred. To correct the error once identified and located, further error correction bits are needed. In implementing such a system, it is well known that the number of checking and correcting bits required for each memory word becomes larger as the number of detectable and correctable errors per memory word goes up. As such, there is an increased hardware cost penalty whenever an error detection and correction system is to be implemented in a computer system.
In typical memory systems having error detection and correction capability, each data fetch results in receiving a plurality of data bits and some checking bits. For example, in some large contemporary computer memories such as in the IBM 3090, the system will fetch 2 data words of 64 bits each plus 16 error checking and correcting (ECC) bits on each memory fetch operation. The system is designed so that 8 ECC bits are used to detect and correct errors in one of the two data words and the remaining data word is checked and corrected by a second set of 8 ECC bits. Given the number of ECC bits used in this scheme, detection and correction of one bit errors in the 64 bit data words checked thereby is made possible. It also permits detection of 2 bit errors in 64 bit words although error correction is not possible.
To further enhance such memory systems, spare chips have been included for replacing chips which have been detected to have failed. When such a bit failure is identified in a given data word, a spare chip is assigned to replace the failing chip. For the exemplary memory system described above, spare chips are assigned to each row of chips that supply each pair of 64 bit data words. Hence, for a quad word fetch, two words of 64 data bits plus 16 ECC bits plus available spare bits are fetched.