The present invention relates in general to fault detection in digital systems, and more specifically to fault detection based on signature analysis during operational conditions.
In complex electronic systems, such as telephone switches and mainframe computers, fault detection and fault localization have become important parts for ensuring a problem free operation. Fault detection and localization can be employed at different stages in the xe2x80x9clifexe2x80x9d of a digital system. First, components, chips and boards are checked after production or before installation in a complex system, in order to remove defective units. An example of such a solution can be found in e.g. U.S. Pat. No. 5,544,174. After installation, the whole unit or system may also be thoroughly checked, in order to eliminate e.g. erroneous connections. Such solutions are e.g. disclosed in U.S. Pat. Nos. 5,600,788, 5,671,233, 5,442,643 and EP 0 733 910 A1.
After taking the unit or system in operation, fault may occur at every instant, and to ensure a safe operation, the system has to be checked for faults intermittently. In many complex digital systems, a safe fault detection and the localization of the fault to a specific replaceable plugin unit, is requested. The defective part may easily be replaced and the system may come into operation within a very short period of time. The defective part may then be checked in more detail, in order to determine if it can be repaired and reused, or if it has to be wasted. There is obviously a need for background fault detection tests, during or intermitently during the normal operation of a complex digital system.
According to the state of the art, hardware fault detection can be employed in four fundamental different manners. Firstly, a full hardware redundancy can be used.
This means that there are two or more sets of hardware doing the same job and their outputs are compared or voted. This approach is e.g. used for logic parts of fault tolerant computers, i.e. for the processors. This approach is very efficient in finding faults, but it involves high costs for the double hardware, and is therefore not economically useful in general applications.
Secondly, one set of hardware is used, but it includes some redundant information that can be used for determining that this is a faulty unit. This can be performed by e.g parity or checksums. This is typically used for memories in computers, but is not well suited for logical parts.
The third approach is based on hardware built-in self tests (BIST). A hardware BIST implementation is based on three parts, a test controller, a test pattern generator and an output response analyser. Usually, BIST test are destructive and are thus not possible to use as background tests. BIST tests can only be performed when the present state of the unit to be checked can be waisted, i.e. they can generally only be used when the system is shut down temporarily. Furthermore, the possible tests are determined from the BIST configuration, and modified or part tests, which are not implementet from the start are difficult to implement. Integrating BIST on board level gives excellent fault detection and it can be done using very limited hardware resources. However, in most cases this is not possible to use, since this type of BIST is not supported by many standard components. Generally, BIST has a good observability, but a rather poor controllability for running part tests on a chip. Also, BIST is limited to function within one circuit, and tests for communication between different circuit or replaceable plugin units may be difficult.
In the fourth approach, fault detection is implemented as software self-tests. The processor executes a program that exercises the hardware and reads information from registers and compares with an expected result that is coded into the program. The extra hardware that is needed is very limited. Generally, only the extra memory space for storing the program is needed. However, fault detection of a high quality can be very hard to get, in particular since the development of fault detection soft-ware is extensive, since it has to be specially designed for every circuitry. The number of, and the location of, nodes where the result of the testing can be checked is normally quite limited. The fault location is therefore often difficult to find. Additionally, is it is not possible to locate faults when signals are passed to other replaceable plugin units without special hardware support. Generally, software self-tests are easily controllable, but the observability is normally limited.
Common for all of the above methods are that they normally only check the final result of a test run. Errors consisting of signal delays, are not very likely to be detected by most of the above methods.
Many complex digital systems, such as telephone switches, are sensitive for time delays. In telephone control systems a time delay of 0.2 is easily recognized by the users and is experienced as a severe disturbance. Such systems has thus to operate more or less continously, and shut-down periods available for testing have to be limited to typically less than 20 ms. Fault detection tests, operating as background tests, therefore have to be performed within one of these shut-down periods. The speed demands for such faults detection tests are high, or the test has to be divided into part tests. None of the above mentioned types of fault detection methods are suitable for such applications.
An object of the present invention is thus to provide digital system devices and a method of performing fault detection in digital systems, which exhibits both an excellent controllability and an excellent observability. A further object of the present invention is to provide digital system devices and a method of performing fault detection in digital systems, which are fast and permitted to be driven as background procedures during normal operation, i.e. is non-destructive.
The above obects are acheived by digital system units according to the attached claims. The digital system units are equipped with a procecssor, comprising processor availablility means, means for setting the logic units to be tested to a predetermined state, means for executing a stimuli generation and means for activating an output response analyser. The output response analyser comprises means for colleting responses from different nodes in the system, and means for creating signatures of the response signals. The system further comprises means for verifying the signatures and means for performing error signalling. A preferred embodiment also comprises means for storing the present state of the processor, during the fault detection test.
A method for performing fault detection is set forth in the attached claims. According to the method, the processor is made available from other ongoing activities, the logic units to be tested are set to a predetermined state, the output response analyser is activated and a stimuli generation is executed. This controllability of the system is thus collected in the processor unit. The output response analyser collects responses of the stimuli and creates signatures of the collected responses. These observability related steps are performed in the output response analyser. Furthermore, the signatures are verified and if a fault is detected, this error is noticed. Preferrably, the present state of the processor is stored prior to the test procedure, and reloaded after the procedure is finished, whereby the original interrupted process can be restarted. Preferably, also the test procedure can be divided into parts, so that each part can be driven separately during different shut-down periods.