1. Field of the Invention
The present invention generally relates to general purpose digital data processing systems and more particularly relates to such systems which provide fault detection therein.
2. Description of the Prior Art
A key design element of high reliability computer systems is that of error detection and correction. It has long been recognized that the integrity of the data bits within the computer system is critical to ensure the accuracy of operations performed in the data processing system. The alteration of a single data bit in a data word can dramatically affect arithmetic calculations or can change the meaning of a data word as interpreted by other sub-systems within the computer system.
One method for performing error detection is to associate an additional bit, called a "parity bit", along with the binary bits comprising a data word. The data word may comprise data, an instruction, an address, etc. Parity involves summing without carry the bits representing a "one" within a data word and providing an additional "parity bit" so that the total number of "ones" across the data word, including the added parity bit, is either odd or even. The term "Even Parity" refers to a parity mechanism which provides an even number of ones across the data word including the parity bit. Similarly, the term "Odd Parity" refers to a parity mechanism which provides an odd number of ones across the data word including the parity bit.
A typical system which uses parity as an error detection mechanism has a parity generation circuit for generating the parity bit. For example, when the system stores a data word into a memory, the parity generation circuit generates a parity bit-from the data word and the system stores both the data word and the corresponding parity bit into an address location in the memory. When the system reads the address location where the data word is stored, both the data word and the corresponding parity bit are read from the memory. The parity generation circuit then regenerates the parity bit from the data bits read from the memory device and compares the regenerated parity bit with the parity bit that is stored in memory. If the regenerated parity bit and the original parity bit do not compare, an error is detected and the system is notified.
It is readily known that a single parity bit in conjunction with a multiple bit data word can detect a single bit error within the data word. However, it is also readily known that a single parity bit in conjunction with a multiple bit data word can be defeated by multiple errors within the data word. As calculation rates increase, circuit sizes decrease, and voltage levels of internal signals decrease, the likelihood of multiple errors within a data word increase. Therefore, methods to detect multiple errors within a data word are essential. In response thereto, system designers have developed methods for detecting multiple errors within multiple bit data words by providing multiple parity bits for each multiple bit data word.
The above referenced error detection schemes may be used on a number of internal nodes within a computer system. However, because additional hardware must be provided for each node protected by the error detection scheme, it may not be practical to provide error detection to a large number of internal nodes within the computer system. That is, the additional cost and power required to provide error detection to a large number of internal nodes of a computer system may be prohibitive.
To combat this limitation, a number of critical nodes within a computer system may be provided with the above referenced error detection schemes. The critical nodes may provide a mechanism for detecting if an error exists within the computer system. However, because the percentage of nodes that typically are provided with an error detection capability may be relatively low when compared to the total number of nodes within the computer system, the actual hardware source of the error may not be isolated with any certainty using this technique.
In a typical prior art system, the system may have a number of circuit boards. Each circuit board may have a number of integrated circuit devices. Each integrated circuit device may have an error register therein. A number of error detection units may be provided on a number of critical nodes within each integrated circuit, as described above. Each of the number of error detection units may assert an error bit within the corresponding error register when an error is detected. That is, the error register may have a number of bits wherein each bit may correspond to one or more error detection unit. After an error is detected, the corresponding integrated circuit device may freeze the state of the corresponding error detection register. Freezing the state of the error detection register may allow the system to perform accurate error analysis later. The corresponding error register may then assert an error bit within a board level error register. The board level error register may have an error bit for each integrated circuit device located on the board. The corresponding board may then freeze the state of the board level error register and provide an error signal to a support controller.
Using this architecture, the support controller may read the contents of the board level error register and determine which of the integrated circuits may have caused the error. Thereafter, the support controller may read the error register in the corresponding integrated circuit device to determine what portion of the integrated circuit may have caused the error.
In the above described system, a hierarchical approach is used to help isolate the portion of the circuitry which may have caused the error. However, this approach may only isolate an error to a particular error detection unit. That is, the hierarchical approach may have the same limitations as described above with reference to-the error detection schemes. Namely, because additional hardware must be provided for each node protected by the error detection scheme, it may not be practical to provide error detection to a large number of internal nodes within the computer system. That is, the additional cost and power required to provide error detection to a large number of internal nodes of a computer system may be prohibitive. Because the percentage of nodes that typically are provided with an error detection capability may be relatively low when compared to the total number of nodes within the computer system, the actual hardware source of the error may not be isolated with any certainty.
One way to further isolate the source of the error is to use built-in-self-test (BIST) techniques. It is known that the complexity of computer systems has increased exponentially over the past several decades. Because of this increased complexity, many of the internal nodes within modern computer systems are not controllable or observable from the external I/O pins. BIST design techniques have been developed to combat this growing problem. BIST can be used to make the internal nodes of a complex computer system both controllable and observable and therefore testable. This is the only method of ensuring hardware integrity in many modern computer systems.
One method for providing BIST is to replace the functional registers within a design with serial scan shift registers. The serial scan shift registers can operate in both a functional mode and a test mode. During normal operations, the serial scan shift registers are placed in functional mode and operate like any other flip-flop. In test mode, the serial scan shift registers are configured into a scan path which allows test data to be "serially shifted" through the registers within the design. Typically, a support controller scans in computer generated serial scan test vectors through the serial scan shift registers within the design. Once these vectors are fully shifted into the design, the data residing in the serial scan shift registers then travels through the logic gates and eventually arrives at either an I/O pin or another serial scan shift register. The serial scan shift registers are then switched into functional mode and the clock is pulsed once. The functional clock causes the serial scan shift registers to capture the data that has traveled through the logic gates. The serial scan shift registers are then switched back into test mode and the results are shifted out and compared to an expected value. This process may be repeated until the source of an error is identified.
A technique that may be used for automating the latter part of this process is to provide a signature analysis register within the design. The signature analysis register is coupled to predetermined nodes within the design. As a predefined pattern is shifted through the design, the signature analysis register is updated periodically. At the end of the test, the contents of the signature analysis register are compared to an expected "signature". If there is a match, the system is deemed to be fully functional. This eliminates the need to compare the results of the serial scan vectors with an expected result and therefore may reduce the complexity of the support controller. However, the signature analysis approach described above typically only provides a pass or fail result and generally does not identify the faulty hardware element. This is a drawback of the signature analysis approach.
Modern computer systems may utilize the above referenced error detection techniques on a number of critical nodes to identify when an error exist therein. A support controller or the like may then be notified of the error wherein BIST techniques may then be used to isolate the source of the error. As indicated above, the signature analysis approach is generally not used to isolate the source of the error. Rather, a support controller may shift test vectors through a corresponding scan path as described above.
A limitation of the above referenced BIST approach is that the corresponding computer system, or at least a portion thereof, must be taken out of functional mode and placed in a test mode. This may require the computer system to be interrupted during the execution of a transaction. In some high reliability computing applications, this may not be acceptable. For example, in the banking industry, high reliability computer systems may be used to process a large number of banking transactions. It may not be acceptable to interrupt the banking computer system whenever a fault is detected, unless the fault is a critical fault which could corrupt a corresponding data base. As can readily be seen, an interruption of the banking computer may cause a transaction to be lost. Similarly, in the airline industry, high reliability computer systems may be used to process a large number of seat reservations. It may not be acceptable to interrupt the airline reservation computer system whenever a fault is detected, unless the fault is a critical fault which could corrupt a corresponding data base. As can readily be seen, an interruption of the airline reservation computer may cause a seat assignment to be lost, thereby allowing a single seat to be assigned to multiple passengers or the like.