1. Field of Invention
This invention relates to error recovery systems, and in particular to characterizing and repairing intelligent systems using historical behavior of the systems.
2. Description of Related Art
Intelligent systems such as programmable robots and distributed networks, and even more abstract products such as software programs, are built according to manufacturing tolerances. For example, a machine is generally built to within certain design tolerances for component size and fit, although it may function within broader specifications. However, as the machine interacts with its environment, the machine's performance may degrade. For example, physical parts will wear out over time so that the machine will react differently to the same stimuli at different times.
Software programs should behave the same way all the time because they have no "moving parts" to degrade. However, in intelligent networks, as more components are added to the network or as existing components are upgraded, interactions of the components may become more complex. Thus, there is a possibility that control software may react differently over time. For example, in a new computer system the task of downloading a file may complete with no problems. However, if some components of software or hardware are upgraded, such as with a new operating system or storage media, a download of the same file may not complete because of the changes in the system. Further, as the physical machines on which the software runs begin to age, electronic errors may occur in hardware components with a corresponding effect on the operation of the software and overall system.
While eventual system failures can therefore be expected in a variety of intelligent systems, when they occur the process of identifying which hardware component or which software module failed can be very difficult and time consuming. The conventional approach for repairing intelligent systems is to essentially tear down a piece of equipment suspected to be faulty. That is, the network or physical component is taken offline, and its components are analyzed piece by piece until the defective part and source of error is identified. This method of error detection and recovery is very time consuming, and because it is intrusive can lead to further errors in the machine or network, making recovery even more difficult during the attempted diagnosis.