1. Field of Invention
This invention relates to error recovery systems, and in particular to characterizing and repairing intelligent systems using historical behavior of the systems.
2. Description of Related Art
Intelligent systems such as programmable robots and distributed networks, and even more abstract products such as software programs, are built according to manufacturing tolerances. For example, a machine is generally built to within certain design tolerances for component size and fit, although it may function within broader specifications. However, as the machine interacts with its environment, the machine""s performance may degrade. For example, physical parts will wear out over time so that the machine will react differently to the same stimuli at different times.
Software programs should behave the same way all the time because they have no xe2x80x9cmoving partsxe2x80x9d to degrade. However, in intelligent networks, as more components are added to the network or as existing components are upgraded, interactions of the components may become more complex. Thus, there is a possibility that control software may react differently over time. For example, in a new computer system the task of downloading a file may complete with no problems. However, if some components of software or hardware are upgraded, such as with a new operating system or storage media, a download of the same file may not complete because of the changes in the system. Further, as the physical machines on which the software runs begin to age, electronic errors may occur in hardware components with a corresponding effect on the operation of the software and overall system.
While eventual system failures can therefore be expected in a variety of intelligent systems, when they occur the process of identifying which hardware component or which software module failed can be very difficult and time consuming. The conventional approach for repairing intelligent systems is to essentially tear down a piece of equipment suspected to be faulty. That is, the network or physical component is taken offline, and its components are analyzed piece by piece until the defective part and source of error is identified. This method of error detection and recovery is very time consuming, and because it is intrusive can lead to further errors in the machine or network, making recovery even more difficult during the attempted diagnosis.
The invention relates to techniques for controlling and characterizing systems that can create self-learned error recovery plans for the systems.
In various exemplary embodiments, techniques for performing actions in a system include detecting a particular state of the system after the system performs an action transitioning from a previous state to the particular state, then comparing the detected state to an expected state of the system. If the detected state differs from the expected state, then one or more actions are performed to cause the system to transition from the detected state to a desired, or recovery, state.
In operation, recovery actions performed are determined in part by a trigger, which describes the actions of the system leading to the detected state and further by one or more experience nodes storing recovery information including at least information relating to triggers.
In view of the above limitations for maintaining intelligent systems, the invention relates to a system and method for characterizing and repairing intelligent systems which creates a self-learned error recovery plan for the network. As the system evolves or interacts with the environment, and encounters errors which force it to take action to overcome the error, the error recovery plan is updated and stored in an experience node. The sum of the experience nodes becomes the intelligent system""s experience map.
The invention also allows the error recovery plan to be stored compactly to minimize memory requirements in network nodes- for instance, local hard drive or other media. The experience nodes can then be easily searched and the results from a failed network A, for example, can be compared to a second network B or subsequent networks built with similar components.
In terms of the invention""s general environment, the generation of the experience map is based in part on the fact that each intelligent machine in a network environment is capable of executing a finite set of atomic actions, actions that can not be decomposed into other actions. During the execution of the atomic actions, an error may occur. That is, the intelligent machine may proceed from an error-free state to an error state. Once the intelligent machine arrives in the error state, it must execute one or more atomic actions to return to the error free state. Therefore, the intelligent machine will traverse a selected path through the space of atomic actions to recover to the error free state. The path and the specific atomic actions along the path make up an error recovery plan.
Since each atomic action when executed can lead to an error, each atomic action can be designated as a starting point in an experience node. Then, the error recovery plan will become a part of that experience node. However, each experience node can have more than one error recovery plan, because more than one error can occur during the execution of each atomic action.
The experience node can be represented as a series of interconnected nodes, with the atomic action as the starting node. The paths returning to an error free state are reflected in the connections between the remaining internal nodes of the experience node. Which particular path the intelligent machine will take to recover from an error is determined by a routing key. The routing key is based on conditions existing at the start node prior to the error.
The experience map thus contains the set of possible errors and the corresponding error recovery plans, including the paths of atomic actions taken to return to an error free state. Graphically, the experience map can be represented by a three-dimensional topographical graph with axes for frequency of execution of the error recovery plan, the set of error recovery plans, and the routing keys.
Each intelligent system will develop a unique experience map because each will react differently to the errors it encounters during operation and execution of its atomic actions. Therefore, the experience map can be used to provide diagnostic information and a unique electronic xe2x80x9cfingerprintxe2x80x9d for a particular intelligent system. For example, telecommunications networks may contain several routers that are used to route data packets from an originator to a destination. The data packets are routed according to information contained in a header of the data packet, and the programming of the routers. If 70% of the time the data packets are routed correctly according to the data packets and the programming, and 30% of the time the data packets are rerouted, the network may be experiencing one or more errors. By reviewing the experience map for each router in the network, the identity of the specific router encountering the error, or other component of the router encountering the error, can be identified.