Most digital computer systems contain built-in hardware and software routines which can detect and, sometimes, correct internal hardware failures. Generally, error correction is performed by means of error correcting codes. In accordance with conventional error correction schemes, a first error-correcting code word is computed from data which is stored or transmitted by means of a predetermined mathematical function. The first error-correcting code is then stored or transmitted along with the data. When the data is later retrieved or reaches its destination, a second error-correcting code is computed from the retrieved or transmitted data using the same mathematical function as was originally used to compute the first error-correcting code. The first code is compared to the second code and, if the codes are equal, then it is assumed that no errors in transmission or storage have occurred. If the codes are not equal, it is assumed that an error has occurred. In many cases, by an inspection of the differences between the codes, the location of the error can be detected and, in some cases, corrected.
Generally, when an error occurs in hardware, hardware diagnostic circuits generate error-correcting codes and place them in a portion of the computer's memory known as an "error log". For example, a common error correcting code is known as a "Hamming code". This code can be used not only to detect an error but also to correct single bit errors. In the case of a single bit error, diagnostic software records, in the error log, the Hamming code and the component address which was present on the address bus when the error occurred.
Failures can occur in memory boards, input devices, output devices, disk operating systems and even the CPU. If the failure is in a memory component, in addition to the address of the failed component, normally the bit position in the memory word which caused the failure is also recorded in the error log. This latter information is necessary due to the conventional construction of computer random-access memories (RAMs). More particularly, each memory array is typically comprised of a number of integrated circuit chips. Each chip contains a large number of memory locations and memory address circuitry which allows it to select and access one of the internal memory locations. However, each internal chip location generally holds only a single bit of information. Thus, a typical RAM integrated circuit chip may have 256,000 locations, each holding a single bit (designated as a 256K X 1 RAM chip). In order to construct a memory array of multiple-bit words (for example a 40 bit word) the plurality of RAM chips are stacked together to form an array. Thus, a 40 bit by 256K memory word 256K X 1 RAM chips. The address decoding circuitry on each chip is arranged so that corresponding memory locations are accessed in each chip to form a complete multiple-bit word. With this construction, each bit of the memory word is located in a different RAM memory chip so that the bit position of faulty memory word must be recorded in the error log along with the actual memory address
If a single memory bit is in error, it is then corrected so that the data is correct. For multiple-bit errors the Hamming code and failure address is recorded in the error log, but generally multiple bit errors cannot be corrected.
Although the error log contains information regarding the failure address, bit position and error code, a problem often arises when a technician attempts to physically locate the printed circuit board on which the failed component resides. In order to do this the technician must know the actual address decoding scheme used on each printed circuit wiring board for the particular model of the computer under repair. Since address decoding schemes are often quite complicated, it is usually necessary for the technician to carry decoding tables for each computer system which he is expected to repair. The tables contain address ranges and the actual printed circuit boards on which each address range lies.
The decoding problem becomes further complicated if a memory component fails and there are several sizes of memory arrays in a single computer system. For example, different-sized memory boards may be used in order to precisely tailor the memory size to the customer's needs. In order for a technician to use decoding tables, it may be necessary for him to carry a set of tables for each model of machine which he is expected to repair.
In other systems, the memory decoding scheme may be established under control of the operating software at the time when the system is configured. Applicant's co-pending application Ser. No. 046457, filed May 4, 1987 is an example of such a system. In some software-configured systems, the decoding scheme is determined by the actual size of the memory array located on each board. However, other computers decode memory addresses on each board based on the board which contains the largest memory array. Consequently, it may not be possible for a technician to ascertain that a faulty memory chip is located on a particular board eve if he knows the range of memory addresses in which the faulty memory location resides. For such a system, memory tables cannot be used.
Thus, it is often difficult and time-consuming to physically locate the correct printed wiring board which contains the faulty component in order to replace the board and restore the system back to normal operation.